KR102404947B1

KR102404947B1 - Method and apparatus for machine learning based identification of structural variants in cancer genomes

Info

Publication number: KR102404947B1
Application number: KR1020210123291A
Authority: KR
Inventors: 주영석; 박한솔; 박성열
Original assignee: 주식회사 지놈인사이트
Priority date: 2020-09-17
Filing date: 2021-09-15
Publication date: 2022-06-10
Also published as: KR20220037376A

Abstract

본 발명의 일 실시예에 따른 컴퓨팅 장치에 의해 수행되는 유전체 구조 변이 식별을 위한 방법은, 암 유전체 데이터와 정상조직 유전체 데이터의 쌍으로 구성된 전장 유전체 데이터로부터 식별된 구조 변이 후보들을 획득하는 단계와, 상기 전장 유전체 데이터 내에서 각 구조 변이 후보로부터 소정 범위에 위치한 데이터에 기초하여, 각 구조 변이 후보의 특징들(features)을 추출하는 단계와, 알려진 구조 변이 목록에 기초하여, 각 구조 변이 후보에 분류 정보(classification)를 레이블링하는 단계와, 및 각 구조 변이 후보의 특징들과 상기 레이블링된 분류 정보가 매핑된 데이터셋을 이용하여, 기계학습모델을 학습시키는 단계를 포함하되, 상기 기계학습모델은 식별 대상 구조 변이 후보를 입력 받아 상기 식별 대상 구조 변이 후보의 분류를 출력할 수 있다.A method for identifying a genome structure variation performed by a computing device according to an embodiment of the present invention includes: obtaining identified structural variation candidates from full-length genome data composed of a pair of cancer genome data and normal tissue genome data; Extracting features of each structural mutation candidate based on data located within a predetermined range from each structural mutation candidate in the whole genome data, and classifying each structural mutation candidate based on a list of known structural mutation candidates A step of labeling information (classification), and using a dataset to which the characteristics of each structural variation candidate and the labeled classification information are mapped, comprising the steps of training a machine learning model, wherein the machine learning model is identified The target structure variation candidate may be input, and the classification target structure variation candidate may be output.

Description

Machine learning-based genomic structure mutation identification method and apparatus

본 발명은 유전체 구조 변이 식별에 관한 것이다.The present invention relates to the identification of genomic structural variations.

차세대 염기서열 분석(Next Generation Sequencing)은 유전체를 무수히 많은 조각으로 나눠서 읽고, 얻어진 염기서열 조각을 조립하여 유전체의 서열을 분석하는 유전체 고속 분석 방법이다. 차세대 염기서열 분석 기술의 등장으로 분석된 전장 유전체 서열(Whole-Genome Sequencing, WGS)은 거의 모든 유형의 체세포 돌연변이(somatic mutations) 검출에 유용하고, 이러한 유용성 덕분에 전장 유전체 분석이 광범위하게 이루어지고 있으며, 특히 암 유전체학에서 매우 중요한 역할을 하고 있다.Next Generation Sequencing is a high-speed genome analysis method that reads the genome by dividing it into countless fragments, and then assembles the obtained nucleotide sequence fragments to analyze the genome sequence. With the advent of next-generation sequencing technology, whole-genome sequencing (WGS) analyzed is useful for detecting almost all types of somatic mutations, and thanks to this usefulness, whole-genome analysis is being performed extensively. , especially in cancer genomics.

구조 변이(Structural Variants, SV)는 암 발생 과정에 중요한 역할을 하므로, 암 유전체에서 체세포 돌연변이를 검출하기 위해 많은 생물 정보학 알고리즘과 툴이 개발되었다. 하지만, 구조 변이 탐색 툴은 높은 민감도를 얻기 위해, 어쩔 수 없이 출력에 상당한 수의 위양성(False Positives, FP)을 포함하게 된다. 그래서, 구조 변이 탐색 툴에서 호출된 구조 변이 후보들 중에서 위양성을 제거하여 정확한 구조 변이(True Positives) 목록을 생성하는 작업이 후속되어야 한다. Structural variants (SVs) play an important role in the process of cancer development, so many bioinformatics algorithms and tools have been developed to detect somatic mutations in cancer genomes. However, structural variation detection tools inevitably include a significant number of false positives (FPs) in their output to achieve high sensitivity. Therefore, the work of generating an accurate list of true positives by removing false positives from among the structural mutation candidates called from the structural mutation search tool should be followed.

하지만, 위양성 제거를 위해, 전문가들이 시간 소모적이고 노동 집약적인 휴리스틱 시행 착오를 통해 최적의 필터링 조건을 설정해야 하였다. 이처럼 위양성 제거 작업이 사람 역량에 의존하게 됨에 따라, 개인간 및 실험실간 변동성이 발생할 뿐만 아니라, 다양한 암 유전체 연구에 적용할 수 있는 정형화된 필터링 규칙이 없어서, 생물 정보학 해석 품질을 유지하기 어렵고, 확장성이 떨어지는 문제가 있다. 또한 PCAWG(Pan-Cancer Analysis of Whole Genomes) 컨소시엄에서 2,658개의 암 전장 유전체에 주석을 달기 위해, 최고 수준의 전문가들이 최종 돌연변이 콜세트를 동결하는 데 수년을 보낸 사례를 보면, 분석 시간이 상당히 소요되는 것을 알 수 있다. 특히나, 유전체 데이터의 생산 비용을 고려하면, 대규모 암 유전체 연구나 임상 환경에서 기존 방법으로 구조 변이를 찾는 것은 현실적인 한계가 있다. However, to eliminate false positives, experts had to set the optimal filtering conditions through time-consuming and labor-intensive heuristic trial and error. As such false-positive removal work depends on human capabilities, inter-individual and inter-laboratory variability occurs, and there is no standardized filtering rule that can be applied to various cancer genome studies, making it difficult to maintain the bioinformatics interpretation quality and scalability. There is a problem with this falling. Additionally, when the Pan-Cancer Analysis of Whole Genomes (PCAWG) consortium spent years freezing the final mutation callset by top-notch experts to annotate 2,658 cancer full genomes, analysis time-consuming it can be seen that In particular, considering the production cost of genomic data, there is a practical limit to finding structural variations using existing methods in large-scale cancer genome research or clinical environments.

한국 공개특허공보 제10-2019-0036494호 (2019.04.04.)Korean Patent Publication No. 10-2019-0036494 (2019.04.04.)

본 개시는, 기계학습모델을 통해 유전체 데이터에서 다양한 구조 변이를 검출하는 유전체 구조 변이 식별 방법 및 장치를 제공하는 것이다.The present disclosure provides a method and apparatus for identifying a genomic structure mutation for detecting various structural mutations in genomic data through a machine learning model.

본 개시는, 구조 변이 탐색 툴에서 탐색된 구조 변이 후보들의 특징들을 추출하고, 양성 구조 변이(진양성)와 음성 구조 변이(위양성)의 특징들을 학습한 기계학습모델을 통해 구조 변이 후보들을 식별하는 방법 및 장치를 제공하는 것이다.The present disclosure provides a method for extracting features of structural mutation candidates searched for by a structural mutation search tool and identifying structural mutation candidates through a machine learning model that has learned the characteristics of positive structural mutation (true positive) and negative structural mutation (false positive). To provide a method and apparatus.

본 발명의 일 실시예에 따른 컴퓨팅 장치에 의해 수행되는 유전체 구조 변이 식별을 위한 방법은, 암 유전체 데이터와 정상조직 유전체 데이터의 쌍으로 구성된 전장 유전체 데이터로부터 식별된 구조 변이 후보들을 획득하는 단계와, 상기 전장 유전체 데이터 내에서 각 구조 변이 후보로부터 소정 범위에 위치한 데이터에 기초하여, 각 구조 변이 후보의 특징들(features)을 추출하는 단계와, 알려진 구조 변이 목록에 기초하여, 각 구조 변이 후보에 분류 정보(classification)를 레이블링하는 단계와, 및 각 구조 변이 후보의 특징들과 상기 레이블링된 분류 정보가 매핑된 데이터셋을 이용하여, 기계학습모델을 학습시키는 단계를 포함할 수 있다. 일 실시예에서, 상기 기계학습모델은 식별 대상 구조 변이 후보를 입력 받아 상기 식별 대상 구조 변이 후보의 분류를 출력할 수 있다.A method for identifying a genome structure variation performed by a computing device according to an embodiment of the present invention includes: obtaining identified structural variation candidates from full-length genome data composed of a pair of cancer genome data and normal tissue genome data; Extracting features of each structural mutation candidate based on data located within a predetermined range from each structural mutation candidate in the whole genome data, and classifying each structural mutation candidate based on a list of known structural mutation candidates It may include the step of labeling information (classification), and training the machine learning model using a dataset to which the characteristics of each structural variation candidate and the labeled classification information are mapped. In an embodiment, the machine learning model may receive an identification target structure variation candidate and output a classification of the identification target structure variation candidate.

일 실시예에서, 상기 유전체 구조 변이 식별을 위한 방법은, 각 구조 변이 후보의 특징들(features)을 추출하는 단계 이전에, 상기 구조 변이 후보들 중 제1 구조 변이 후보와 연관된 SA(Supplementary Alignment Tag) 태그 정보에 기초하여 상기 제1 구조 변이 후보의 추정 위치를 수정하는 단계를 포함할 수 있다. In an embodiment, the method for identifying a dielectric structure variation includes, before extracting features of each structural variation candidate, a Supplementary Alignment Tag (SA) associated with a first structural variation candidate among the structural variation candidates. The method may include correcting the estimated position of the first structural variation candidate based on the tag information.

일 실시예에서, 상기 제1 구조 변이 후보의 추정 위치를 수정하는 단계는, 상기 제1 구조 변이 후보가 매핑되는 기준 서열 상의 제1 지역 및 제2 지역을 식별하는 단계를 포함하되, 상기 제1 지역과 상기 제2 지역은 서로 연속되지 않은 지역일 수 있다. In an embodiment, the modifying the estimated position of the first structural mutation candidate comprises identifying a first region and a second region on a reference sequence to which the first structural mutation candidate is mapped, wherein the first The region and the second region may be regions that are not continuous with each other.

일 실시예에서, 상기 제1 구조 변이 후보의 추정 위치를 수정하는 단계는, 상기 제1 구조 변이 후보와 연관된 절단점(breakpoint)의 위치를 결정하는 단계를 포함할 수 있다. In an embodiment, the modifying the estimated position of the first structural variation candidate may include determining a position of a breakpoint associated with the first structural variation candidate.

일 실시예에서, 상기 각 구조 변이 후보로부터 소정 범위에 위치한 데이터에 기초하여, 각 구조 변이 후보의 특징들(features)을 추출하는 단계는, 상기 전장 유전체 데이터 내에서, 상기 구조 변이 후보들 중 제1 구조 변이 후보와 연관된 절단점으로부터, 상기 제1 구조 변이 후보의 변이 지지 리드들이 위치하는 방향으로 제1 길이 이상의 데이터를 획득하는 단계를 포함할 수 있다. In an embodiment, the extracting of features of each structural mutation candidate based on data located within a predetermined range from each structural mutation candidate may include, in the whole genome data, a first one of the structural mutation candidates. The method may include acquiring data of a first length or more in a direction in which the disparity support leads of the first structural mutation candidate are positioned from a cut point associated with the structural mutation candidate.

일 실시예에서, 상기 제1 거리는, 상기 전장 유전체 데이터의 인서트들의 평균 사이즈일 수 있다. In an embodiment, the first distance may be an average size of inserts of the full-length dielectric data.

일 실시예에서, 상기 구조 변이 후보의 특징들은, 변이 지지 리드들(variant-supporting reads)의 수, 분리 리드들(split reads)의 수, SA 태그를 가진 분리 리드들(split reads with supplementary alignment tags)의 수, 및 정상조직 유전체 데이터에서 분리 리드와 같은 클립 서열을 가진 리드들(reads having the same clipped sequences with split reads within normal WGS data)의 수 중에 적어도 하나를 포함할 수 있다. In one embodiment, the characteristics of the structural variant candidate are: number of variant-supporting reads, number of split reads, split reads with supplementary alignment tags ) and the number of reads having the same clipped sequences with split reads within normal WGS data in normal tissue genome data.

일 실시예에서, 상기 각 구조 변이 후보로부터 소정 범위에 위치한 데이터에 기초하여, 각 구조 변이 후보의 특징들(features)을 추출하는 단계는, 상기 전장 유전체 데이터 내에서, 상기 구조 변이 후보들 중 제1 구조 변이 후보와 연관된 절단점으로부터, 상기 제1 구조 변이 후보의 변이 지지 리드들의 반대 방향으로 제2 길이 이내의 데이터를 획득하는 단계를 포함할 수 있다. In an embodiment, the extracting of features of each structural mutation candidate based on data located within a predetermined range from each structural mutation candidate may include, in the whole genome data, a first one of the structural mutation candidates. The method may include acquiring data within a second length in a direction opposite to the variation supporting leads of the first structural variation candidate from a cut point associated with the structural variation candidate.

일 실시예에서, 상기 제2 길이는 200 bp (base pair)일 수 있다. In an embodiment, the second length may be 200 bp (base pair).

일 실시예에서, 상기 분류 정보를 레이블링하는 단계는, 상기 구조 변이 후보들 중 제1 구조 변이 후보를 양성으로 레이블링하되, 상기 제1 구조 변이 후보는 상기 제1 구조 변이 후보의 위치가 상기 알려진 구조 변이 목록에서 양성으로 표기된 것인, 단계와, 및 상기 구조 변이 후보들 중 제2 구조 변이 후보를 음성으로 레이블링하되, 상기 제2 구조 변이 후보는 상기 제2 구조 변이 후보의 위치가 상기 알려진 구조 변이 목록에서 음성으로 표기된 것인, 단계 중 어느 하나를 포함할 수 있다. In an embodiment, in the labeling of the classification information, a first structural mutation candidate among the structural mutation candidates is positively labeled, wherein the first structural mutation candidate has a known structural mutation location. labeling a second structural mutation candidate from among the structural mutation candidates as positive in the list, wherein the second structural mutation candidate has a position of the second structural mutation candidate in the known structural mutation list. It may include any one of the steps, indicated by a voice.

일 실시예에서, 상기 각 구조 변이 후보의 특징들을 추출하는 단계는, 각 구조 변이 후보에 대해서, 변이 지지 리드들(variant-supporting reads)의 수, 보조 매핑 태그를 가진 리드들(split reads with supplementary alignment tag)의 수, 분리 리드들(split reads)의 수, 매핑 품질(mapping quality), 리드 깊이 변화(read depth change), 배경 노이즈로 분류되는 리드들(background noise reads)의 수, 및 정상조직 유전체 데이터 중에서 같은 변이가 검출된 샘플 수(panel of normal samples)를 추출하는 단계를 포함할 수 있다. In an embodiment, the extracting of the features of each structural mutation candidate may include, for each structural mutation candidate, the number of variant-supporting reads and split reads with supplementary mapping tags. number of alignment tags, number of split reads, mapping quality, read depth change, number of background noise reads classified as background noise, and normal tissue It may include extracting the number of samples in which the same mutation is detected (panel of normal samples) from among the genome data.

일 실시예에서, 상기 각 구조 변이 후보의 특징들을 추출하는 단계는, 각 구조 변이 후보에 대해서, 임상 데이터로부터, 암종(tumor histology), 전장 유전체가 유전체 중복인지를 나타내는 특징(whole-genome duplication), 샘플 내 암세포 비율(tumor purity), 및 암세포 유전체의 배수성(tumor ploidy)을 추출하는 단계를 포함할 수 있다. In one embodiment, the extracting of the features of each structural mutation candidate includes, for each structural mutation candidate, from clinical data, tumor histology, a feature indicating whether the whole genome is a genome duplication (whole-genome duplication) , the tumor purity in the sample, and the tumor ploidy of the cancer cell genome.

일 실시예에서, 상기 기계학습모델은 상기 식별 대상 구조 변이 후보의 특징들을 입력받고, 상기 식별 대상 구조 변이 후보를 양성 또는 음성으로 분류하기 위한 확률 값을 출력하는 기계학습모델일 수 있다. In an embodiment, the machine learning model may be a machine learning model that receives characteristics of the target structure variation candidate to be identified and outputs a probability value for classifying the target structure variation candidate as positive or negative.

일 실시예에서, 상기 유전체 구조 변이 식별을 위한 방법은, 상기 전장 유전체 데이터에 포함된 검증용 샘플들을 이용하여, 상기 기계학습모델의 분류 성능을 검증하는 단계와, 상기 기계학습모델의 상기 분류 성능 검증 결과에 기초하여, 상기 기계학습모델에서 출력된 확률 값이 양성 또는 음성을 가리키는지 결정하기 위한 컷오프값을 결정하는 단계를 더 포함할 수 있다. In an embodiment, the method for identifying the mutation in the genome structure includes: verifying the classification performance of the machine learning model by using samples for verification included in the full-length genome data; the classification performance of the machine learning model The method may further include determining a cutoff value for determining whether the probability value output from the machine learning model indicates positive or negative, based on the verification result.

일 실시예에서, 상기 전장 유전체 데이터로부터 식별된 구조 변이 후보들을 획득하는 단계는, 상기 전장 유전체 데이터를 구조 변이 탐색 툴에 입력하는 단계와, 상기 구조 변이 탐색 툴에 의해 출력된 구조 변이 후보들을 획득하는 단계를 포함할 수 있다. In an embodiment, the obtaining of the structural variation candidates identified from the full-length genome data includes: inputting the full-length genome data into a structure variation search tool; and obtaining structural variation candidates output by the structure variation search tool may include the step of

본 발명의 다른 일 실시예에 따른 컴퓨팅 장치에 의해 수행되는 유전체 구조 변이 식별을 위한 방법은, 식별 대상 전장 유전체 데이터로부터, 구조 변이 후보들을 획득하는 단계와, 상기 전장 유전체 데이터 내에서 각 구조 변이 후보로부터 소정 범위에 위치한 데이터에 기초하여, 각 구조 변이 후보의 특징들(features)을 추출하는 단계와, 상기 추출된 각 구조 변이 후보의 특징들을 학습된 기계학습모델에 입력하여, 상기 각 구조 변이 후보를 음성 구조 변이 또는 양성 구조 변이로 식별하는 단계를 포함할 수 있다. 일 실시예에서, 상기 기계학습모델은, 암 유전체 데이터와 정상조직 유전체 데이터의 쌍으로 구성된 전장 유전체 데이터에서 획득된 학습용 구조 변이 후보의 특징들과 상기 학습용 구조 변이 후보의 분류 정보를 이용하여 학습된 모델일 수 있다. According to another embodiment of the present invention, a method for identifying a dielectric structure variation performed by a computing device includes: obtaining structural variation candidates from whole genome data to be identified; extracting features of each structural variation candidate based on data located within a predetermined range from may include the step of identifying as a negative structural mutation or a positive structural mutation. In an embodiment, the machine learning model is a structure mutation candidate for learning acquired from full genome data composed of a pair of cancer genome data and normal tissue genomic data, and is learned using the classification information of the structural mutation candidate for learning. can be a model.

일 실시예에서, 상기 유전체 구조 변이 식별을 위한 방법은, 상기 획득된 구조 변이 후보들 중에서, 상기 음성 구조 변이로 식별된 구조 변이 후보들을 제거하여, 진정한 구조 변이 목록을 생성하는 단계를 더 포함하는 동작 방법. In one embodiment, the method for identifying a genomic structure mutation further comprises removing the structural mutation candidates identified as the negative structural mutation from among the obtained structural mutation candidates to generate a true structural mutation list. Way.

본 발명의 또 다른 일 실시예에 따른 컴퓨터 판독 가능한 비일시적 저장 매체는 명령어들을 포함하고, 상기 명령어들은 컴퓨팅 장치의 하나 이상의 프로세서에 의해 실행될 때 상기 컴퓨팅 장치로 하여금, 암 유전체 데이터와 정상조직 유전체 데이터의 쌍으로 구성된 전장 유전체 데이터로부터 식별된 구조 변이 후보들을 획득하는 단계와, 상기 전장 유전체 데이터 내에서 각 구조 변이 후보로부터 소정 범위에 위치한 데이터에 기초하여, 각 구조 변이 후보의 특징들(features)을 추출하는 단계와, 알려진 구조 변이 목록에 기초하여, 각 구조 변이 후보에 분류 정보(classification)를 레이블링하는 단계와, 각 구조 변이 후보의 특징들과 상기 레이블링된 분류 정보가 매핑된 데이터셋을 이용하여, 기계학습모델을 학습시키는 단계를 포함하는 동작을 수행하도록 하되, 상기 기계학습모델은 식별 대상 구조 변이 후보를 입력 받아 상기 식별 대상 구조 변이 후보의 분류를 출력할 수 있다. A computer-readable non-transitory storage medium according to another embodiment of the present invention includes instructions, which, when executed by one or more processors of a computing device, cause the computing device to generate cancer genome data and normal tissue genome data. obtaining identified structural mutation candidates from full genome data consisting of pairs of extracting, labeling each structural variation candidate with classification information based on a known structural variation list, and using a dataset to which features of each structural variation candidate and the labeled classification information are mapped , training the machine learning model, wherein the machine learning model may receive an identification target structure variation candidate and output a classification of the identification target structure variation candidate.

본 발명의 또 다른 일 실시예에 따른 컴퓨팅 장치는 프로세서 및 명령어들을 저장하는 메모리를 포함하고, 상기 명령어들은 상기 프로세서에 의해 실행될 때 상기 컴퓨팅 장치로 하여금, 암 유전체 데이터와 정상조직 유전체 데이터의 쌍으로 구성된 전장 유전체 데이터로부터 식별된 구조 변이 후보들을 획득하는 단계와, 상기 전장 유전체 데이터 내에서 각 구조 변이 후보로부터 소정 범위에 위치한 데이터에 기초하여, 각 구조 변이 후보의 특징들(features)을 추출하는 단계와, 알려진 구조 변이 목록에 기초하여, 각 구조 변이 후보에 분류 정보(classification)를 레이블링하는 단계와, 및 각 구조 변이 후보의 특징들과 상기 레이블링된 분류 정보가 매핑된 데이터셋을 이용하여, 기계학습모델을 학습시키는 단계를 포함하는 동작을 수행하도록 하되, 상기 기계학습모델은 식별 대상 구조 변이 후보를 입력 받아 상기 식별 대상 구조 변이 후보의 분류를 출력할 수 있다.A computing device according to another embodiment of the present invention includes a processor and a memory for storing instructions, wherein the instructions, when executed by the processor, cause the computing device to generate a pair of cancer genome data and normal tissue genome data. Obtaining identified structural mutation candidates from the constructed whole genome data, and extracting features of each structural mutation candidate based on data located within a predetermined range from each structural mutation candidate in the whole genome data. and labeling each structural variation candidate with classification information based on the known structural variation list, and using a dataset to which the labeled classification information is mapped with the characteristics of each structural variation candidate. An operation including the step of learning the learning model is performed, wherein the machine learning model may receive an identification target structure variation candidate and output a classification of the identification target structure variation candidate.

실시예에 따르면, 기계학습모델을 통해 유전체에서 다양한 구조 변이를 정확하게 검출할 수 있고, 위양성을 제거하여 정확한 구조 변이 목록을 생성할 수 있어서, 암 진단을 비롯한 종양학의 발전에 기여할 수 있다.According to an embodiment, it is possible to accurately detect various structural variations in the genome through a machine learning model, and to generate an accurate list of structural variations by removing false positives, thereby contributing to the development of oncology including cancer diagnosis.

실시예에 따르면, 표준화된 구조 변이 후보의 특징들로 학습된 기계학습모델을 이용하여, 구조 변이 후보들에서 위양성을 빠르게 제거할 수 있다.According to an embodiment, it is possible to quickly remove false positives from the structural mutation candidates by using a machine learning model trained with the characteristics of the standardized structural mutation candidates.

실시예에 따르면, 표준화된 특징들로 학습된 기계학습모델을 이용하는 경우, 전문가들이 수동 검사를 통한 휴리스틱 시행 착오를 통해 최적의 필터링 조건을 설정해서 위양성을 필터링하는 데 소요되는 상당한 시간을 대폭 줄일 수 있다.According to the embodiment, when a machine learning model trained with standardized features is used, experts set the optimal filtering conditions through heuristic trial and error through manual inspection, thereby significantly reducing the time required for filtering false positives. have.

실시예에 따르면, 기계학습모델을 이용한 표준화된 구조 변이 탐색을 통해 확장성을 높일 수 있고, 전문가에 의한 수동 큐레이션이 불필요하여 비용과 시간을 줄일 수 있다.According to the embodiment, scalability can be increased through standardized structural variation search using a machine learning model, and manual curation by experts is unnecessary, thereby reducing cost and time.

실시예에 따르면, 기계학습모델의 간단한 재학습을 통해 다양한 코호트에 적용할 수 있으며, 분류 성능을 기초로 최적의 컷오프를 찾을 수 있다. According to an embodiment, it can be applied to various cohorts through simple re-learning of the machine learning model, and an optimal cutoff can be found based on classification performance.

도 1은 종래의 구조 변이 식별 방법을 도식적으로 설명하는 도면이다.
도 2는 한 실시예에 따른 유전체 구조 변이 식별 장치의 구성도이다.
도 3은 한 실시예에 따른 데이터 구조를 설명하는 도면이다.
도 4a 및 도 4b는 한 실시예에 따른 구조 변이 후보 지역에서 구조 변이 정보 추출 방법을 예시적으로 설명하는 도면이다.
도 5는 한 실시예에 따른 유전체 구조 변이 식별을 위한 학습 방법의 흐름도이다.
도 6은 한 실시예에 따른 학습된 기계학습모델을 이용한 구조 변이 식별을 도식적으로 설명하는 도면이다.
도 7은 한 실시예에 따른 학습된 기계학습모델을 이용한 구조 변이 식별 방법의 흐름도이다.
도 8은 한 실시예에 따른 컴퓨팅 장치의 하드웨어 구성도이다.1 is a diagram schematically illustrating a conventional method for identifying structural variations.
2 is a block diagram of a dielectric structure variation identification apparatus according to an exemplary embodiment.
3 is a diagram for explaining a data structure according to an embodiment.
4A and 4B are diagrams exemplarily illustrating a method of extracting structural variation information from a structural variation candidate region according to an exemplary embodiment.
5 is a flowchart of a learning method for identifying a mutation in a dielectric structure according to an exemplary embodiment.
6 is a diagram schematically illustrating structural variation identification using a learned machine learning model according to an embodiment.
7 is a flowchart of a structural variation identification method using a learned machine learning model according to an embodiment.
8 is a hardware configuration diagram of a computing device according to an embodiment.

아래에서는 첨부한 도면을 참고로 하여 본 발명의 실시예에 대하여 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 상세히 설명한다. 그러나 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시예에 한정되지 않는다. 그리고 도면에서 본 발명을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다.Hereinafter, with reference to the accompanying drawings, the embodiments of the present invention will be described in detail so that those of ordinary skill in the art to which the present invention pertains can easily implement them. However, the present invention may be embodied in several different forms and is not limited to the embodiments described herein. And in order to clearly explain the present invention in the drawings, parts irrelevant to the description are omitted, and similar reference numerals are attached to similar parts throughout the specification.

명세서 전체에서, 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미한다. 또한, 명세서에 기재된 "…부", "…기", "모듈" 등의 용어는 적어도 하나의 기능이나 동작을 처리하는 단위를 의미하며, 이는 하드웨어나 소프트웨어 또는 하드웨어 및 소프트웨어의 결합으로 구현될 수 있다.Throughout the specification, when a part "includes" a certain element, it means that other elements may be further included, rather than excluding other elements, unless otherwise stated. In addition, terms such as “…unit”, “…group”, and “module” described in the specification mean a unit that processes at least one function or operation, which may be implemented as hardware or software or a combination of hardware and software. have.

도 1은 종래의 구조 변이 식별 방법을 도식적으로 설명하는 도면이다.1 is a diagram schematically illustrating a conventional method for identifying structural variations.

도 1을 참고하면, 암-정상 전장 유전체 서열(whole-genome sequence)에서 구조 변이(Structural Variants, SV)를 탐색하는 툴(10)로서, DELLY, BRASS, SvABA, dRanger 등이 존재한다. 구조 변이 탐색 툴에서 호출자가 전장 유전체 서열에서 구조 변이를 호출하는데, 출력된 구조 변이 후보들(structural variation candidates)에 진양성(True Positives, TP)뿐만 아니라, 상당한 위양성(False Positives, FP)이 포함되어 있다. Referring to FIG. 1 , as a tool 10 for searching for structural variations (SV) in cancer-normal whole-genome sequences, DELLY, BRASS, SvABA, dRanger, and the like exist. In the structural variation search tool, the caller calls a structural variation from the whole genome sequence, and the output structural variation candidates include true positives (TP) as well as significant false positives (FP). have.

따라서, 호출된 구조 변이 후보들 중에서 위양성을 제거하여 정확한 구조 변이 목록을 생성하는 작업이 후속되어야 한다. 지금까지는 전문가들이 수동 검사를 통한 휴리스틱 시행 착오를 통해 최적의 필터링 조건을 설정해서 위양성을 필터링하므로, 상당한 시간이 걸렸다. 또한, 샘플 품질과 암 유형에 따라 필터링 조건이 다르기 때문에, 상당한 시간이 걸려 설정한 필터링 조건이라도 다른 코호트에 적용하면 정확한 결과를 제공하지 못하고, 코호트를 위해 필터링 조건을 조정하는 시간이 추가적으로 걸리게 된다. 게다가 다양한 조건들이 관련되어 최적의 컷오프(cutoff)를 찾기 어렵고, 수동 검사에 기반한 필터링 조건이라서 위양성 분류 기준을 설명하기 어려운 한계가 있다.Therefore, it is necessary to remove false positives from among the called structure mutation candidates to generate an accurate structure mutation list. Until now, experts filter false positives by setting optimal filtering conditions through heuristic trial and error through manual inspection, so it took a considerable amount of time. In addition, since the filtering conditions are different depending on the sample quality and cancer type, it takes a considerable amount of time to apply the set filtering conditions to other cohorts, but it does not provide accurate results, and it takes additional time to adjust the filtering conditions for the cohort. In addition, it is difficult to find an optimal cutoff because various conditions are related, and it is difficult to explain the false-positive classification criteria because it is a filtering condition based on manual inspection.

한편, 기계학습 기반의 인공지능 기술이 개발되면서, 유전체학에도 기계학습을 적용한 분석이 시도되고 있다. 예를 들어, 구글은 심층 신경망을 이용하여, 생식세포(germline) 관련 변이를 호출하는 DeepVariant를 개발하였다. 점 돌연변이(point mutation)와 같은 체세포 변이(somatic variant) 발견을 위한 기계학습 기반 방법은 지난 몇 년 동안 활발하게 연구되고 있다. 그러나 유전체 재배열(genomic rearrangements)이라고도 하는 구조 변이는 아직까지 연구가 미흡하다. 이는 구조 변이가, 단순한 결실(deletion), 복제(duplication), 역위(inversion), 삽입(insertion) 및 전좌(translocation)에서부터, 대규모 유전체 재배열을 포함하는 복잡성 때문이다. On the other hand, as machine learning-based artificial intelligence technology has been developed, analysis by applying machine learning to genomics is being attempted. For example, Google developed DeepVariant using deep neural networks to call germline-related mutations. Machine learning-based methods for the discovery of somatic variants such as point mutations have been actively studied over the past few years. However, structural variations, also called genomic rearrangements, are still insufficiently studied. This is due to the complexity of structural variations, ranging from simple deletions, duplications, inversions, insertions and translocations, to large-scale genomic rearrangements.

예를 들어, 점 돌연변이나 짧은 서열의 삽입 및 결실(small insertion and deletion, indel)은, 시퀀싱 데이터의 기본 단위인 리드(read) 내에 존재하는 변이로, 정확한 위치를 특정할 수 있어서, 그 위치에서 변이(variant)를 가진 리드와 기준(reference) 리드를 조사하면 된다. 반면, 구조 변이의 크기는 하나의 리드보다 길기 때문에, 변이 지지 리드들(variant-supporting reads)이 절단점(breakpoint)에만 존재하지 않아서, 정확한 위치를 알 수 없는 경우가 많다. 따라서, 알려진 구조 변이 탐색 툴에서 찾아주는 위치가 부정확한 경우가 많다. For example, a point mutation or small insertion and deletion (indel) of a short sequence is a mutation that exists in a read, a basic unit of sequencing data, and can specify an exact position, Examine leads with variants and reference leads. On the other hand, since the size of the structural variant is longer than one read, the variant-supporting reads do not exist only at the breakpoint, and thus the exact location is often not known. Therefore, a location found by a known structural variation search tool is often inaccurate.

다음에서 기계학습모델을 이용하여 복잡성을 가지는 유전체 구조 변이를 자동으로 식별하는 방법 및 장치에 대해 자세히 설명한다.In the following, a method and apparatus for automatically identifying genomic structural variation with complexity using a machine learning model will be described in detail.

도 2는 한 실시예에 따른 유전체 구조 변이 식별 장치의 구성도이고, 도 3은 한 실시예에 따른 데이터 구조를 설명하는 도면이다.2 is a block diagram of a dielectric structure variation identification apparatus according to an embodiment, and FIG. 3 is a diagram for explaining a data structure according to an embodiment.

도 2를 참고하면, 유전체 구조 변이 식별 장치(간단히, “장치”라고 한다)(100)는 적어도 하나의 프로세서에 의해 동작하는 컴퓨팅 장치로서, 구조 변이 탐색 툴(10)에서 출력된 구조 변이 후보들의 특징들을 추출하는 특징 추출부(Feature Extractor)(110), 각 구조 변이 후보에 양성(True)/음성(False)을 레이블링하는 어노테이션부(Annotator)(130), 특징들에 레이블링된 데이터셋을 이용하여 기계학습모델(200)을 학습시키는 학습부(Trainer)(150)를 포함한다. 기계학습모델(200)은 로지스틱 회귀(logistic regression) 모델, 랜덤 포레스트(random forest) 모델, 확률적 경사 부스팅(stochastic gradient boosting) 모델 등과 같이 다양한 모델로 구현될 수 있다.Referring to FIG. 2 , a dielectric structure variation identification device (referred to simply as “device”) 100 is a computing device operated by at least one processor, and is a structure variation search tool 10 of the structural variation candidates output. A feature extractor 110 for extracting features, an annotator 130 for labeling positive (True)/negative (False) for each structural variation candidate, and a dataset labeled with features are used to include a learning unit (Trainer) 150 for learning the machine learning model (200). The machine learning model 200 may be implemented as various models such as a logistic regression model, a random forest model, a stochastic gradient boosting model, and the like.

장치(100)는 적어도 하나의 프로세서를 포함하고, 프로세서가 컴퓨터 프로그램에 포함된 명령어들을 실행함으로써, 본 개시의 동작을 수행한다. 컴퓨터 프로그램은 프로세서가 본 개시의 동작을 실행하도록 기술된 명령어들(instructions)을 포함하고, 비일시적-컴퓨터 판독가능 저장매체(non-transitory computer readable storage medium)에 저장될 수 있다.The apparatus 100 includes at least one processor, and the processor executes instructions included in a computer program, thereby performing the operations of the present disclosure. The computer program includes instructions that are described for a processor to execute the operations of the present disclosure, and may be stored in a non-transitory computer readable storage medium.

여기서, 특징 추출부(110), 어노테이션부(130), 학습부(150), 그리고 기계학습모델(200)은 컴퓨터 판독 가능한 저장매체에 저장되는 컴퓨터 프로그램으로 구현될 수 있고, 장치(100)는 컴퓨터 프로그램에 포함된 명령어들(instructions)을 실행함으로써 특징 추출부(110), 어노테이션부(130), 학습부(150), 그리고 기계학습모델(200)을 동작시킬 수 있다. 학습된 기계학습모델(200)을 비롯한 특징 추출부(110), 어노테이션부(130), 학습부(150), 그리고 도 6에서 설명한 필터링부(170)는 컴퓨터 프로그램으로 제작되어 네트워크를 통해 다운로드되거나, 제품 형태로 판매될 수 있고, 연구소, 병원 등의 다양한 사이트의 컴퓨팅 장치에 설치될 수 있다.Here, the feature extraction unit 110, the annotation unit 130, the learning unit 150, and the machine learning model 200 may be implemented as a computer program stored in a computer-readable storage medium, and the apparatus 100 By executing instructions included in the computer program, the feature extraction unit 110 , the annotation unit 130 , the learning unit 150 , and the machine learning model 200 may be operated. The feature extraction unit 110, the annotation unit 130, the learning unit 150, and the filtering unit 170 described in FIG. 6, including the learned machine learning model 200, are manufactured as a computer program and downloaded through a network or , may be sold in the form of a product, and may be installed in computing devices of various sites such as research institutes and hospitals.

구조 변이 탐색 툴(10)은, 호출자가 전장 유전체 데이터에서 구조 변이를 호출하여 구조 변이 후보들을 출력하는 툴로서, 호출된 구조 변이 후보들을 장치(100)로 출력한다고 가정한다. 본 개시에서, 구조 변이 탐색 툴(10)은 알려진 툴을 사용한다고 가정한다.It is assumed that the structural mutation search tool 10 is a tool for outputting structural mutation candidates by a caller calling structural mutations from whole genome data, and it is assumed that the called structural mutation candidates are output to the apparatus 100 . In the present disclosure, it is assumed that the structure variation search tool 10 uses a known tool.

암-정상 전장 유전체 데이터(20)는 동일인에서 추출한 암-정상 샘플들을 포함한다. 각 샘플은 암 유전체 데이터와 정상조직 유전체 데이터 쌍으로 구성되며, 일루미나 등의 시퀀서로부터 생산된 짧은 리드들(short-reads)로 시퀀싱될 수 있다. 암-정상 전장 유전체 데이터(20)는 다양한 종류의 암 샘플들을 포함할 수 있다. The cancer-normal whole genome data 20 includes cancer-normal samples extracted from the same person. Each sample is composed of a pair of cancer genome data and normal tissue genome data, and can be sequenced with short-reads produced by a sequencer such as Illumina. The cancer-normal whole genome data 20 may include various types of cancer samples.

도 3을 참고하면, 본 개시에서 구축한 암-정상 전장 유전체 데이터(20)는 총 1,808개의 샘플들을 포함하고, 학습용 샘플들과 검증용 샘플들을 포함할 수 있다. 본 개시에서 사용한 학습용 샘플들은 국제 암유전체 컨소시움에서 정제된 20개 종양 조직의 35가지 종양 유형에 해당하는 총 1,212개의 암유전체를 포함하고, 검증용 샘플들은 동일 코호트의 499개의 암유전체와 독립적인 연구의 97개의 암유전체를 포함하는데, 샘플들의 구성은 다양하게 조정될 수 있다.Referring to FIG. 3 , the cancer-normal whole genome data 20 constructed in the present disclosure includes a total of 1,808 samples, and may include training samples and verification samples. The training samples used in this disclosure included a total of 1,212 oncogenomes corresponding to 35 tumor types from 20 tumor tissues purified from the International Oncogenome Consortium, and the validation samples were independently studied with 499 oncogenomes from the same cohort. of 97 cancer genomes, the composition of the samples can be variously adjusted.

특징 추출부(110)는 구조 변이 탐색 툴(10)에서 전장 유전체 데이터(WGS data)에서 탐색된 구조 변이 후보들을 입력받고, 암 유전체 데이터와 정상조직 유전체 데이터의 쌍에서 각 구조 변이 후보의 주변 지역을 조사하여 각 구조 변이 후보의 특징들(features)을 추출한다. 또한, 특징 추출부(110)는 임상 데이터로부터 추출된 일부 특징들을 포함할 수 있다. 특징 추출부(110)는 각 구조 변이 후보의 특징들을 기계학습을 위한 데이터셋(40)으로 저장한다. The feature extraction unit 110 receives structural mutation candidates searched for in the whole genome data (WGS data) by the structural mutation search tool 10, and a region surrounding each structural mutation candidate in the pair of cancer genome data and normal tissue genome data. to extract the features of each structural mutation candidate. Also, the feature extraction unit 110 may include some features extracted from clinical data. The feature extractor 110 stores the features of each structural variation candidate as a dataset 40 for machine learning.

한편, 구조 변이는 하나의 리드 내에서 변이 위치가 특정되는 점 돌연변이나 짧은 삽입 및 결실과 달리, 정확한 위치를 알 수 없는 경우가 많다. 따라서, 특징 추출부(110)는 구조 변이 탐색 툴(10)에서 호출된 구조 변이 후보들의 위치가 부정확할 수 있으므로, 보조 매핑 태그(SA Tag: Supplementary Alignment Tag)를 사용하여 정확한 좌표로 수정하고, 동일한 가닥 방향(예, 5 '에서 3')으로 서로 500 염기서열 이내에 있는 후보들을 병합할 수 있다. On the other hand, unlike point mutations or short insertions and deletions in which the mutation position is specified within one read, the exact position of structural mutations is often unknown. Therefore, the feature extraction unit 110 uses a supplementary alignment tag (SA Tag) to correct the positions of the structural variation candidates called by the structural variation search tool 10 to correct coordinates, Candidates within 500 nucleotide sequences of each other in the same strand direction (eg, 5' to 3') can be merged.

여기서 보조 매핑 태그란, 얼라인먼트 툴이 각 리드를 기준 서열에 매핑한 결과로서 출력되는 데이터 중 하나일 수 있다. 특정 리드의 보조 매핑 태그는, 기준 서열 상에서 해당 리드가 매핑되는 주된 위치 이외에, 해당 리드가 매핑될 수 있는 기준 서열 상의 또 다른 위치를 나타내는 정보일 수 있다. 예컨대 도 4b의 리드(315)와 같이, 만약 특정 리드가 절단점(breakpoints; 예컨대 도 4b의 지점(311, 312)), 즉 구조 변이가 시작되거나 끝나는 지점에 위치하는 분리 리드(split read)인 경우, 해당 리드는 기준 서열 상에서 서로 연속되지 않은 상이한 두 위치 내지는 지역에 매핑될 수 있다. 구체적으로, 절단점에 위치한 리드의 대부분(예컨대 도 4b의 참조번호(317))은 기준 서열의 제1 지역에 매핑되고, 해당 리드의 나머지 부분(예컨대 도 4b의 참조번호(316))은 제1 지역과 연속되지 않은 제2 지역에 매핑되는 것으로 분석될 수 있다. 이러한 경우, 얼라인먼트 툴은 상기 리드가 기준 서열 상의 제1 지역에 매핑된다고 판정하되, 상기 리드의 나머지 일부분(예컨대 도 4b의 참조번호(316))이 매핑되는 제2 지역에 관한 정보를 보조 매핑 태그(SA Tag)로서 기록하여 출력할 수 있다. Here, the auxiliary mapping tag may be one of data output as a result of the alignment tool mapping each read to a reference sequence. The auxiliary mapping tag of a specific read may be information indicating another position on the reference sequence to which the corresponding read can be mapped, in addition to the main position to which the corresponding read is mapped on the reference sequence. For example, such as read 315 in FIG. 4B , if a particular read is a split read located at breakpoints (eg, points 311 and 312 in FIG. 4B ), i.e., the point at which a structural transformation begins or ends. In this case, the read may be mapped to two different positions or regions that are not contiguous with each other on the reference sequence. Specifically, most of the reads located at the cleavage point (eg, reference number 317 in FIG. 4B ) are mapped to the first region of the reference sequence, and the remainder of the read (eg, reference number 316 in FIG. 4B ) is the second region. It can be analyzed as being mapped to a second region that is not continuous with the first region. In this case, the alignment tool determines that the read maps to a first region on a reference sequence, but provides information about a second region to which the remaining portion of the read (eg, reference numeral 316 in FIG. 4B ) maps to an auxiliary mapping tag. (SA Tag) can be recorded and printed out.

특징 추출부(110)는 보조 매핑 태그를 추가적으로 사용하여, 어느 하나의 리드가, 구조 변이가 시작되거나 끝나는 절단점에 위치하는 분리 리드인지 여부를 보다 정확히 판정하고, 이를 통해 구조 변이 후보들의 절단점의 위치를 보다 정확하게 결정함으로써, 특징을 추출할 지역 범위를 최적화할 수 있다.The feature extraction unit 110 additionally uses an auxiliary mapping tag to more accurately determine whether any one read is a separate read located at a cut point where the structure variation starts or ends, and through this, the cut point of the structure variation candidates By more accurately determining the location of , it is possible to optimize the region range from which to extract features.

다시 도 3을 참고하면, 구조 변이 탐색 툴(10)에서, 샘플별로 복수의 구조 변이 후보들이 탐색되면, 각 구조 변이 후보는 유전체 데이터에서 두 절단점(breakpoint1 및 breakpoint2) 사이의 교차점(junction)으로 표시될 수 있다. 여기서, 두 절단점들은 염색체(chromosome) 및 위치(position)에 따라 정렬된다. Referring back to FIG. 3 , when a plurality of structural mutation candidates are searched for by sample in the structural mutation search tool 10, each structural mutation candidate is a junction between two breakpoints (breakpoint1 and breakpoint2) in the genome data. can be displayed. Here, the two cut points are aligned according to chromosome and position.

특징 추출부(110)는 암 유전체 데이터와 정상조직 유전체 데이터에서 각 구조 변이 후보의 절단점들(breakpoint1 및 breakpoint2)을 조사하여, 정상조직 유전체 데이터와 다른 암 유전체 데이터의 특징들을 추출한다. The feature extraction unit 110 extracts features of cancer genome data different from normal tissue genome data by examining breakpoints (breakpoint1 and breakpoint2) of each structural variation candidate in cancer genome data and normal tissue genome data.

구조 변이는 충분한 수의 변이 지지 리드들(variant-supporting reads)을 가지고, 절단점에 의해 분리된 리드들(split reads)을 많이 포함하는 특징이 있다. 반면, 정상 샘플에서도 동일하게 발견되는 짧은 결실(short deletion)이나 짧은 역위(short inversion)가 오류의 원인이 되어 위양성 구조 변이로 탐색된다. 또한, 위양성 구조 변이는 높은 리드 깊이(high read depth), 그리고 낮은 매핑 품질(low mapping quality) 등을 특징으로 보이는 의한 노이즈 영역에서 발견된다. 이러한 구조 변이의 특징을 찾기 위해, 특징 추출부(110)가 변이 지지 리드들(variant-supporting reads)의 수(표 1의 특징 13 및 14), 분리 리드들(split reads)의 수(표 1의 특징 16 및 17), 보조 매핑 태그를 가진 리드들(split reads with supplementary alignment tag)의 수(표 1의 특징 18), 매핑 품질(mapping quality)(표 1의 특징 22 및 23), 리드 깊이 변화((read depth change)(표 1의 특징, 24 및 25), 배경 노이즈로 분류되는 리드들(background noise reads)의 수(표 1의 특징 33, 34, 37, 38), 정상조직 유전체 데이터 중에서 같은 변이가 검출된 샘플 수(panel of normal samples)(표 1의 특징 45) 등을 추출한다.Structural variants are characterized by having a sufficient number of variant-supporting reads and containing many split reads by cleavage points. On the other hand, short deletions or short inversions, which are identically found in normal samples, cause errors and are detected as false-positive structural mutations. In addition, false-positive structural variants are found in the noisy regions characterized by high read depth, low mapping quality, and the like. In order to find the characteristics of this structural variation, the feature extraction unit 110 determines the number of variant-supporting reads (features 13 and 14 in Table 1) and the number of split reads (Table 1). features 16 and 17), number of split reads with supplementary alignment tag (feature 18 in Table 1), mapping quality (features 22 and 23 in Table 1), read depth read depth change (features 24 and 25 in Table 1), number of background noise reads classified as background noise (features 33, 34, 37, 38 in Table 1), normal tissue genomic data Among them, the number of samples in which the same mutation was detected (panel of normal samples) (characteristic 45 in Table 1) and the like are extracted.

구체적으로, 특징 추출부(110)는 표 1과 같이 정의된 45개의 특징들을 저장하고, 특징들을 찾기 위해 설정된 사용자 정의 함수들을 이용하여 각 특징이 지시하는 값을 추출할 수 있다. 45개의 특징들 중에서, 암종(tumor histology), 전장 유전체가 유전체 중복인지를 나타내는 특징(whole-genome duplication), 샘플 내 암세포 비율(tumor purity), 암세포 유전체의 배수성(tumor ploidy)은 샘플에 대한 임상 데이터로부터 추출될 수 있다. 나머지 41개의 특징들은, 특징 추출부(110)가 전장 유전체 데이터에서 각 구조 변이 후보의 주변 지역을 조사하여 각 특징에 해당하는 값(바이너리 값, 카운트 수, p-value, 빈도, 품질, 비율, 거리 등)을 추출한다. Specifically, the feature extraction unit 110 may store 45 features defined as shown in Table 1, and extract a value indicated by each feature using user-defined functions set to find the features. Among the 45 characteristics, tumor histology, whole-genome duplication, tumor purity in the sample, and tumor ploidy of the cancer cell genome can be extracted from data. For the remaining 41 features, the feature extraction unit 110 investigates the surrounding regions of each structural variation candidate in the whole genome data, and values corresponding to each feature (binary value, number of counts, p-value, frequency, quality, ratio, distance, etc.).

No　No 특징Characteristic 설명Explanation 1One tumor histologytumor histology 종양 조직 (20 카테고리): Biliary, Bladder, Bone_SoftTissue, Breast, Cervix, CNS, Colon_Rectum, Esophagus, Head_Neck, Hematologic, Kideny, Liver, Lung, Ovary, Pancreas, Prostate, Skin, Stomach, Thyroid, UterusTumor Tissue (20 categories): Biliary, Bladder, Bone_SoftTissue, Breast, Cervix, CNS, Colon_Rectum, Esophagus, Head_Neck, Hematologic, Kideny, Liver, Lung, Ovary, Pancreas, Prostate, Skin, Stomach, Thyroid, Uterus 22 whole-genome duplicationwhole-genome duplication 전장 유전체가 유전체 중복(whole-genome duplication)인지를 나타내는 특징 (바이너리 값): TRUE, FALSEA feature indicating whether the whole genome is a whole-genome duplication (binary value): TRUE, FALSE 33 structural variant typestructural variant type 구조 변이 종류 (4 카테고리): 결실(deletion), 복제(duplication), 역위(inversion), 전좌(translocation)Types of structural variation (4 categories): deletion, duplication, inversion, translocation 44 positional variation at breakpoint1positional variation at breakpoint1 암 유전체 데이터의 절단점1에서 변이 지지 리드들의 매핑 위치의 변동 여부 (바이너리 값): TRUE, FALSEWhether the mapping position of the mutation support reads at the cut point 1 of the cancer genome data is changed (binary value): TRUE, FALSE 55 positional variation at breakpoint2positional variation at breakpoint2 암 유전체 데이터의 절단점2에서 변이 지지 리드들의 매핑 위치의 변동 여부 (바이너리 값): TRUE, FALSEWhether the mapping position of the mutation support reads at the cut point 2 of the cancer genome data is changed (binary value): TRUE, FALSE 66 microhomologymicrohomology 두 절단점들에서 짧은 상동 DNA 서열 (microhomologous sequences) 유무 (바이너리 값): TRUE, FALSEPresence or absence of short microhomologous sequences at both breakpoints (binary value): TRUE, FALSE 77 tumor puritytumor purity 암 유전체 데이터 내 암세포 비율Ratio of cancer cells in cancer genomic data 88 tumor ploidytumor ploidy 암세포 유전체의 배수성Ploidy of the cancer cell genome 99 reference read count at breakpoint1 in tumor WGSreference read count at breakpoint1 in tumor WGS 암 유전체 데이터의 절단점1에서 기준 리드 수Reference number of reads at cut point 1 in cancer genomic data 1010 reference read count at breakpoint2 in tumor WGSreference read count at breakpoint2 in tumor WGS 암 유전체 데이터의 절단점2에서 기준 리드 수Reference number of reads at cut point 2 in cancer genomic data 1111 reference read count at breakpoint1 in normal WGSreference read count at breakpoint1 in normal WGS 정상조직 유전체 데이터의 절단점1에서 기준 리드 수Reference number of reads at cut point 1 of normal tissue genomic data 1212 reference read count at breakpoint2 in normal WGSreference read count at breakpoint2 in normal WGS 정상조직 유전체 데이터의 절단점2에서 기준 리드 수Reference number of reads at cut point 2 of normal tissue genomic data 1313 total variant-supporting read count in tumor WGStotal variant-supporting read count in tumor WGS 암 유전체 데이터에서 변이 지지 리드 수Number of mutation support reads in cancer genomic data 1414 total variant-supporting read count in normal WGStotal variant-supporting read count in normal WGS 정상조직 유전체 데이터에서 변이 지지 리드수Number of mutation-supporting reads in normal tissue genome data 1515 significance of variant fractionsignificance of variant fraction 암 및 정상조직 유전체 데이터의 기준 리드와 변이 지지 리드 수의 2x2 표에 대한 Fisher's exact test의 p-valueThe p-value of Fisher's exact test for the 2x2 table of the number of reference reads and mutation support reads in cancer and normal tissue genomic data. 1616 split read count in tumor WGSsplit read count in tumor WGS 암 유전체 데이터에서 분리 리드 수Number of Separation Leads in Cancer Genomic Data 1717 split read count in normal WGSsplit read count in normal WGS 정상조직 유전체 데이터에서 분리 리드 수Number of isolated reads from normal tissue genomic data 1818 count of split reads with supplementary alignment tag in tumor WGScount of split reads with supplementary alignment tag in tumor WGS 암 유전체 데이터에서 보조 매핑 태그를 가진 리드 수Number of reads with secondary mapping tags in cancer genomic data 1919 same clipped read count in normal WGSsame clipped read count in normal WGS 정상조직 유전체 데이터에서 분리 리드와 같은 클립 서열을 가진 리드 수Number of reads with the same clip sequence as isolated reads in normal tissue genomic data 2020 variant allele fraction at breakpoint1 in tumor WGSvariant allele fraction at breakpoint1 in tumor WGS 암 유전체 데이터의 절단점1에서 변이 지지 리드 비율Percentage of mutation-supporting reads at cut point 1 in cancer genomic data 2121 variant allele fraction at breakpoint2 in tumor WGSvariant allele fraction at breakpoint2 in tumor WGS 암 유전체 데이터의 절단점2에서 변이 지지 리드 비율Percentage of mutation-supporting reads at cut point 2 in cancer genomic data 2222 median mapping quality at breakpoint1 in tumor WGSmedian mapping quality at breakpoint1 in tumor WGS 암 유전체 데이터의 절단점1에서 변이 지지 리드들의 매핑 품질의 중간값Median Mapping Quality of Variation Support Reads at Cut Point 1 of Cancer Genomic Data 2323 median mapping quality at breakpoint2 in tumor WGSmedian mapping quality at breakpoint2 in tumor WGS 암 유전체 데이터의 절단점2에서 변이 지지 리드들의 매핑 품질의 중간값Median Mapping Quality of Variation Support Reads at Cut Point 2 in Cancer Genomic Data 2424 depth ratio change at breakpoint1 in tumor WGSdepth ratio change at breakpoint1 in tumor WGS 암 유전체 데이터의 절단점1에서 리드 깊이 변화Read Depth Variation at Cut Point 1 in Cancer Genomic Data 2525 depth ratio change at breakpoint2 in tumor WGSdepth ratio change at breakpoint2 in tumor WGS 암 유전체 데이터의 절단점2에서 리드 깊이 변화Read Depth Variation at Cut Point 2 in Cancer Genomic Data 2626 new mate count at breakpoint1 in tumor WGSnew mate count at breakpoint1 in tumor WGS 암 유전체 데이터의 절단점1에서 변이 지지 리드의 보조 매핑 태그에서 추출된 새로운 구조 변이 개수Number of new structural variants extracted from auxiliary mapping tags of mutation-supporting reads at cut point 1 of cancer genome data 2727 new mate count at breakpoint2 in tumor WGSnew mate count at breakpoint2 in tumor WGS 암 유전체 데이터의 절단점2에서 변이 지지 리드의 보조 매핑 태그에서 추출된 새로운 구조 변이 개수Number of new structural variants extracted from auxiliary mapping tags of mutation-supporting reads at cut point 2 of cancer genome data 2828 neo mate count at breakpoint2 in tumor WGSneo mate count at breakpoint2 in tumor WGS 26번 특징에 해당하는 리드들에서 유추된 구조 변이 개수 Number of structural variations inferred from leads corresponding to feature 26 2929 neo mate count at breakpoint2 in tumor WGSneo mate count at breakpoint2 in tumor WGS 27번 특징에 해당하는 리드들에서 유추된 구조 변이 개수 Number of structural variants inferred from leads corresponding to feature number 27 3030 distance between breakpoint1 and 2distance between breakpoint1 and 2 절단점1과 절단점2 사이의 거리로 구조 변이가 전좌인 경우에는 1번 염색체 크기로 설정The distance between cut point 1 and cut point 2, which is set to the size of chromosome 1 if the structural variation is translocation 3131 total clipped read count at breakpoint1 in tumor WGStotal clipped read count at breakpoint1 in tumor WGS 암 유전체 데이터의 절단점1에서 모든 클립 리드 수The number of all clip leads at cut point 1 in the cancer genome data. 3232 total clipped read count at breakpoint2 in tumor WGStotal clipped read count at breakpoint2 in tumor WGS 암 유전체 데이터의 절단점2에서 모든 클립 리드 수Number of all clip leads at cut point 2 in cancer genomic data 3333 total clipped read count at breakpoint1 in normal WGStotal clipped read count at breakpoint1 in normal WGS 정상조직 유전체 데이터의 절단점1에서 모든 클립 리드 수Number of all clip reads at cut point 1 of normal tissue genomic data 3434 total clipped read count at breakpoint2 in normal WGStotal clipped read count at breakpoint2 in normal WGS 정상조직 유전체 데이터의 절단점2에서 모든 클립 리드 수Number of all clip reads at cut point 2 of normal tissue genomic data 3535 other discordant read pair cluster at breakpoint1 in tumor WGSother discordant read pair cluster at breakpoint1 in tumor WGS 암 유전체 데이터의 절단점1에서 변이 지지 리드 외 불일치 리드 쌍 클러스터 수Number of clusters of mismatched read pairs other than mutation-supporting leads at cut point 1 in cancer genomic data 3636 other discordant read pair cluster at breakpoint1 in tumor WGSother discordant read pair cluster at breakpoint1 in tumor WGS 암 유전체 데이터의 절단점2에서 변이 지지 리드 외 불일치 리드 쌍 클러스터 수Number of mismatched lead pair clusters other than mutation-supporting reads at cut point 2 in cancer genomic data 3737 other discordant read pair cluster at breakpoint1 in normal WGSother discordant read pair cluster at breakpoint1 in normal WGS 정상조직 유전체 데이터의 절단점1에서 변이 지지 리드 외 불일치 리드 쌍 클러스터 수Number of clusters of mismatched read pairs other than mutation-supporting reads at cut point 1 in normal tissue genomic data 3838 other discordant read pair cluster at breakpoint1 in normal WGSother discordant read pair cluster at breakpoint1 in normal WGS 정상조직 유전체 데이터의 절단점2에서 변이 지지 리드 외 불일치 리드 쌍 클러스터 수Number of clusters of mismatched read pairs other than mutation-supporting reads at cut point 2 in normal tissue genomic data 3939 read depth at breakpoint1 in normal WGSread depth at breakpoint1 in normal WGS 정상조직 유전체 데이터의 절단점1에서의 리드 깊이Read depth at cut point 1 of normal tissue genome data 4040 read depth at breakpoint2 in normal WGSread depth at breakpoint2 in normal WGS 정상조직 유전체 데이터의 절단점2에서의 리드 깊이Read depth at cut point 2 of normal tissue genome data 4141 GC content at breakpoint1GC content at breakpoint1 절단점1의 ±50bp에서 G 및 C 염기의 비율Ratio of G and C bases at ±50 bp of cut point 1. 4242 GC content at breakpoint2GC content at breakpoint2 절단점2의 ±50bp에서 G 및 C 염기의 비율Ratio of G and C bases at ±50 bp of cut point 2 4343 soft-masked base at breakpoint1soft-masked base at breakpoint1 절단점1의 ±50bp에서 소프트 마스크 (soft-masked)된 염기 비율Ratio of soft-masked bases at ±50bp of cut point 1. 4444 soft-masked base at breakpoint2soft-masked base at breakpoint2 절단점2의 ±50bp에서 소프트 마스크 (soft-masked)된 염기 비율Percentage of soft-masked bases at ±50 bp of cut point 2 4545 panel of normal samples panel of normal samples 학습데이터에 속한 정상조직 유전체 데이터 중에서 같은 변이가 검출된 샘플 수The number of samples in which the same mutation was detected among the genome data of normal tissues belonging to the training data.

표 1에서, 암종(tumor histology), 전장 유전체가 유전체 중복인지를 나타내는 특징(whole-genome duplication), 구조 변이 종류(structural variant type), 절단점1 및 절단점2에서 변이 지지 리드 매핑 위치의 변동 여부(positional variation at breakpoint1, positional variation at breakpoint2), 두 절단점들에서 짧은 상동 DNA 서열 유무(microhomology)는 범주형 변수로서, 원-핫 인코딩을 사용하여 변환된 숫자가 기재될 수 있다. 나머지 39개의 특징들은 연속형 변수로서, 각 특징에 해당하는 숫자가 기재될 수 있다. In Table 1, tumor histology, whole-genome duplication, structural variant type, and variation-supported lead mapping position at cut-off point 1 and cut point 2 change. Whether or not (positional variation at breakpoint1, positional variation at breakpoint2) and the presence or absence of short homologous DNA sequences at two breakpoints (microhomology) are categorical variables, the number converted using one-hot encoding can be described. The remaining 39 features are continuous variables, and a number corresponding to each feature may be described.

표 1의 45번 특징인 정상조직 유전체 데이터 중에서 같은 변이가 검출된 샘플 수(panel of normal samples)는 학습 코호트(training cohort)에 포함된 정상조직 유전체 데이터의 총 집합으로부터 추출된다. 데이터 수가 많을수록 구조 변이가 참인지 비교하는 대상이 많아지므로, 데이터가 클수록 좋은 성능을 얻을 수 있다.The number of samples in which the same mutation is detected among the normal tissue genome data, characteristic of No. 45 in Table 1, is extracted from the total set of normal tissue genome data included in the training cohort. As the number of data increases, the number of objects to compare whether the structure variation is true or not, the larger the data, the better the performance.

한편, 구조 변이 검출의 부정확성 때문에 추출된 특징이 구조 변이를 잘 대변하지 못할 수 있다. 이를 해결하기 위해, 특징 추출부(110)는 보조 매핑 태그를 사용하여, 구조 변이가 시작되거나 끝나는 절단점에 위치하는 분리 리드를 보다 정확히 판정하고, 이를 통해 구조 변이 후보들의 절단점의 위치를 보다 정확하게 결정함으로써, 특징을 추출할 후보 지역 범위를 최적화할 수 있다. On the other hand, the extracted features may not represent the structural variation well due to inaccuracy in detecting the structural variation. To solve this problem, the feature extraction unit 110 uses the auxiliary mapping tag to more accurately determine the separation lead located at the cut point where the structure variation starts or ends, and through this, the location of the cut point of the structure variation candidates By accurately determining, it is possible to optimize the range of candidate regions from which to extract features.

또한, 구조 변이 후보 지역이 굉장히 많기 때문에, 특징 추출에 소요되는 시간이 상당하다. 이를 위해, 특징 추출부(110)는 유전체 데이터에서 노이즈가 기준 이상인 노이즈 지역을 검출하고, 검출한 노이즈 지역의 특징 추출을 생략할 수 있다. 이를 통해 구조 변이 후보의 특징 추출 속도를 높일 수 있다. In addition, since there are many structural variation candidate regions, the time required for feature extraction is considerable. To this end, the feature extraction unit 110 may detect a noise region in which noise is greater than or equal to a reference value in the dielectric data, and omit feature extraction of the detected noise region. Through this, it is possible to increase the speed of feature extraction of structural variation candidates.

특히, 구조 변이의 크기는 하나의 리드보다 길기 때문에, 변이 지지 리드들(variant-supporting reads)이 절단점에만 존재하지 않아서, 정확한 위치를 알 수 없는 경우가 많다. 이를 해결하기 위해, 특징 추출부(110)는 표 1의 특징들을 추출하기 위해 설정된 사용자 정의 함수들을 통해 각 구조 변이 후보의 주변 지역을 조사할 수 있다. In particular, since the size of the structural variation is longer than one read, the variant-supporting reads do not exist only at the cleavage point, and thus the exact location cannot be known in many cases. To solve this problem, the feature extraction unit 110 may investigate the surrounding area of each structural variation candidate through user-defined functions set to extract the features of Table 1 .

이때, 절단점의 주변 지역을 조사할 범위(search range)는 구조 변이 종류나 추출 특징에 따라 다르게 설정될 수 있다. 특히, 표 1의 특징들 중에서, 절단점의 주변 지역에서 변이 지지 리드들의 매핑 위치의 변동 여부(표 1의 특징 4, 5), 두 절단점들에서 짧은 상동 DNA 서열 유무(표 1의 6), 절단점에서 기준 리드 수(표 1의 특징 9-12), 변이 지지 리드 수(표 1의 특징 13, 14), 분리 리드 수(표 1의 특징 16, 17), 보조 매핑 태그를 가진 리드 수(표 1의 특징 18), 정상조직 유전체 데이터에서 분리 리드와 같은 클립 서열을 가진 리드 수(표 1의 특징 19), 암 유전체 데이터의 절단점에서 변이 지지 리드들의 매핑 품질의 중간값(표 1의 특징 22, 23), 암 유전체 데이터의 절단점에서 리드 깊이 변화(표 1의 특징 24, 25), 암 유전체 데이터의 절단점에서 변이 지지 리드의 보조 매핑 태그에서 추출된 새로운 구조 변이 개수 및 유추된 구조 변이 개수(표 1의 특징 26-29), 절단점에서 모든 클립 리드 수(표 1의 특징 31-34), 절단점에서 변이 지지 리드 외 불일치 리드 쌍 클러스터 수(표 1의 특징 35-38), 정상조직 유전체 데이터의 절단점에서의 리드 깊이(표 1의 특징 39, 40) 등이 추출되어, 복잡성을 가진 구조 변이를 정확히 식별할 수 있다. In this case, the search range for the area surrounding the cut point may be set differently depending on the type of structural variation or extraction characteristics. In particular, among the features of Table 1, whether the mapping position of the mutation support reads in the region surrounding the breakpoint is changed (features 4 and 5 in Table 1), and the presence or absence of a short homologous DNA sequence at the two breakpoints (6 in Table 1) , number of reference reads at the cut point (features 9-12 in Table 1), number of transition support reads (features 13 and 14 in Table 1), number of separation reads (features 16 and 17 in Table 1), leads with secondary mapping tags The number (Feature 18 in Table 1), the number of reads with the same clip sequence as the isolated read in normal tissue genomic data (Character 19 in Table 1), and the median value of mapping quality of the mutation-supporting reads at the cut point in the cancer genome data (Table 1) 1), the read depth change at the cut point of the cancer genomic data (features 24 and 25 in Table 1), the number of new structural variants extracted from the auxiliary mapping tag of the mutation-supporting read at the cut point of the cancer genomic data, and Number of inferred structural variants (features 26-29 in Table 1), number of all clip leads at the cut point (features 31-34 in Table 1), and number of non-discordant lead pair clusters at the cut point (feature 35 in Table 1) -38), the read depth at the cut point of the normal tissue genome data (features 39 and 40 in Table 1), etc. are extracted, so that structural variations with complexity can be accurately identified.

몇몇 실시예에서, 특히 변이 지지 리드 수(표 1의 특징 13, 14), 분리 리드 수(표 1의 특징 16, 17), 보조 매핑 태그를 가진 리드 수(표 1의 특징 18), 정상조직 유전체 데이터에서 분리 리드와 같은 클립 서열을 가진 리드 수(표 1의 특징 19)에 관한 특징들을 추출하는 경우, 절단점의 주변 지역을 조사할 범위는 아래와 같이 설정될 수 있다.In some embodiments, in particular the number of mutation-supporting reads (features 13 and 14 in Table 1), the number of segregated reads (characters 16 and 17 in Table 1), the number of reads with secondary mapping tags (characteristic 18 in Table 1), normal tissue In the case of extracting features related to the number of reads (Feature 19 in Table 1) having the same clip sequence as an isolated read from genomic data, the range to investigate the area around the cut point can be set as follows.

먼저 절단점(예컨대 도 4b의 위치(311, 312))으로부터 구조 변이 지지 리드들이 위치하는 방향(예컨대 도 4b에서 segment라고 표기된 방향)으로는, 처리 대상 전장 유전체 서열 시퀀스 전체 데이터의 평균 인서트 사이즈 또는 그 이상 조사하도록 설정될 수 있다. 이는 구조 변이의 크기가 하나의 리드보다 길기 때문에, 어느 절단점과도 겹치지 않으면서 두 절단점의 segment 방향에 위치하는 변이 지지 리드들도 존재할 수 있기 때문이다. First, from the cut point (eg, positions 311 and 312 in FIG. 4B ) to the direction in which the structural variation support leads are located (eg, the direction indicated as segment in FIG. 4B ), the average insert size of the entire genome sequence sequence to be processed or It can be set to investigate further. This is because, since the size of the structural variation is longer than one lead, there may be displacement support leads located in the segment direction of the two cut points without overlapping any cut points.

한편, 구조 변이 탐색 툴(10)에 의해 출력되는 구조 변이 후보 리드들에 관한 데이터가 부정확할 가능성을 고려하여, 절단점(예컨대 도 4b의 위치(311, 312))으로부터 구조 변이 지지 리드들의 반대 방향(예컨대 도 4b에서 gap 이라고 표기된 방향)으로도 소정의 범위를 조사하는 것이 바람직할 수 있다. 구체적으로, 절단점으로부터 gap 방향으로 예컨대 200 base pair 만큼을 조사하도록 설정될 수 있다. 만약 절단점으로부터 gap 방향으로 200 base pair 보다 더 넓은 범위를 조사할 경우, 상기 절단점과 연관된 구조 변이 후보를 구성하는 리드가 아닌 다른 구조 변이 또는 기타 불일치 리드(discordant reads)에 기인한 데이터가 조사될 가능성이 있음에 유의한다. On the other hand, in consideration of the possibility that the data regarding the structural variation candidate leads output by the structural variation search tool 10 is inaccurate, the opposite of the structural variation supporting leads from the cut point (eg, positions 311 and 312 in FIG. 4B ). It may be desirable to irradiate a predetermined range also in a direction (eg, a direction marked as gap in FIG. 4B ). Specifically, it may be set to irradiate, for example, 200 base pairs in the gap direction from the cut point. If a range wider than 200 base pairs is investigated in the gap direction from the cut point, data resulting from structural variations or other discordant reads other than the reads constituting the structural variation candidates associated with the cut point are investigated. Note that there is a possibility that

다시 도 3을 참고하면, 어노테이션부(130)는 전장 유전체 데이터에서 탐색된 구조 변이 후보들에 각 구조 변이 후보에 양성(True)/음성(False)을 레이블링하고, 기계학습을 위한 데이터셋(40)으로 저장한다. 이때, 어노테이션부(130)는 알려진 구조 변이 목록(30)을 참조하여, 각 구조 변이 후보가 구조 변이 목록(30)에 존재하면 양성, 그렇지 않으면 음성을 레이블링할 수 있다. Referring back to FIG. 3 , the annotation unit 130 labels positive (True)/negative (False) for each structural mutation candidate to structural mutation candidates searched for in whole genome data, and a dataset 40 for machine learning. save as In this case, the annotation unit 130 may refer to the known structure variation list 30 , and label positive if each structure variation candidate is present in the structure variation list 30 , otherwise label as negative.

특징 추출부(110)에서 추출한 구조 변이 후보별 특징들과, 어노테이션부(130)에서 분류한 구조 변이 후보별 분류 클래스는 도 3과 같이, 데이터셋(40)에 저장된다. The features for each structural variation candidate extracted by the feature extraction unit 110 and the classification class for each structural variation candidate classified by the annotation unit 130 are stored in the dataset 40 as shown in FIG. 3 .

학습부(150)는 데이터셋(40)을 이용하여 기계학습모델(200)을 학습시킨다. 학습부(150)는 구조 변이 후보별 특징들로부터 해당 구조 변이 후보를 분류 클래스로 분류하도록 기계학습모델(200)을 학습시킨다. 기계학습모델(200)은 로지스틱 회귀(logistic regression) 모델, 랜덤 포레스트(random forest) 모델, 확률적 경사 부스팅(stochastic gradient boosting) 모델 등과 같이 다양한 모델로 구현될 수 있다. The learning unit 150 learns the machine learning model 200 using the dataset 40 . The learning unit 150 trains the machine learning model 200 to classify the corresponding structural mutation candidate into a classification class from the features of each structural mutation candidate. The machine learning model 200 may be implemented as various models such as a logistic regression model, a random forest model, a stochastic gradient boosting model, and the like.

학습부(150)는 암-정상 전장 유전체 데이터(20)에 포함된 검증용 샘플들을 이용하여, 기계학습모델(200)의 성능을 검증한 후, 학습을 완료할 수 있다. 학습부(150)는 정밀도-민감도(precision-recall) 커브를 사용하여 성능을 검증할 수 있다.The learning unit 150 may use the verification samples included in the cancer-normal whole genome data 20 to verify the performance of the machine learning model 200 and then complete the learning. The learner 150 may verify performance using a precision-recall curve.

학습된 기계학습모델(200)은 구조 변이 후보별 특징들을 입력받고, 0과 1 사이의 확률값을 출력할 수 있다. 각 구조 변이 후보의 확률값은 설정된 컷오프값에 따라 양성 또는 음성으로 판단된다. 음성인 구조 변이 후보는 위양성으로 필터링될 수 있다. The learned machine learning model 200 may receive features for each structural variation candidate and output a probability value between 0 and 1. The probability value of each structural variation candidate is determined to be positive or negative according to the set cutoff value. Structural mutation candidates that are negative may be filtered out as false positives.

본 개시에 따른 기계학습모델(200)을 이용하는 경우, 249개의 테스트 샘플에서 보고된 32,134개의 구조 변이 대해, 이전 연구에서 찾지 못한 5,800개 정도의 구조 변이를 새롭게 발견하였다.In the case of using the machine learning model 200 according to the present disclosure, about 5,800 structural variations not found in previous studies were newly discovered for 32,134 structural variations reported in 249 test samples.

도 4a 및 도 4b는 한 실시예에 따른 구조 변이 후보의 지역에서 구조 변이 정보 추출 방법을 예시적으로 설명하는 도면이다.4A and 4B are diagrams exemplarily illustrating a method of extracting structural variation information from a region of a structural variation candidate according to an exemplary embodiment.

도 4a를 참고하면, 특징 추출부(110)는 구조 변이 탐색 툴(10)에서 호출된 각 구조 변이 후보에 대해, 위양성을 구분하기 위해 정의된 특징들을 추출한다. 도 4a에서, 유전체 데이터의 리드들(reads)은 메이트(mate) 리드들끼리 색으로 표시되는데, 회색이 아닌 리드들은　불일치　리드　쌍(discordant read　pair)을 나타낸다. Referring to FIG. 4A , the feature extraction unit 110 extracts features defined for classifying false positives for each structural variation candidate called by the structural variation search tool 10 . In FIG. 4A , reads of the dielectric data are colored with mate leads, and non-gray reads indicate a discordant read pair.

예를 들어, 복제(duplication) 이벤트가 존재하는 진양성 구조 변이(310)와 위양성 구조 변이(320)가 호출된 경우, 특징 추출부(110)는 표 2와 같이 암 유전체 데이터에서의 변이 지지 리드들(variant-supporting reads), 분리 리드들(split reads), 매핑 품질(mapping quality), 정상조직 유전체 데이터에서의 변이 지지 리드들(variant-supporting reads in normal), 리드 깊이 변화(read depth change), 배경 노이즈 리드들(background noise reads)의 값을 추출하고, 이외에도 표 1에서 정의한 복수의 특징들의 값을 추출할 수 있다. 필요에 따라, 불일치 리드 쌍(discordant read pairs)이 계산될 수 있다.For example, when a true-positive structural mutation 310 and a false-positive structural mutation 320 in which a duplication event exists are called, the feature extractor 110 reads mutation support in cancer genome data as shown in Table 2 Variant-supporting reads, split reads, mapping quality, variant-supporting reads in normal tissue genomic data, read depth change , values of background noise reads may be extracted, and in addition, values of a plurality of features defined in Table 1 may be extracted. If desired, discordant read pairs can be calculated.

추출된 특징extracted features 진양성 구조 변이True-positive structural mutations 위양성 구조 변이false positive structural mutation Variant-supporting readsVariant-supporting reads 77 44 Split readsSplit reads 44 00 mapping qualitymapping quality 5656 2020 Variant-supporting reads in normalVariant-supporting reads in normal 00 1One Read depth changeRead depth change 1.81.8 0.90.9 Background noise readsBackground noise reads 00 88

특징 추출부(110)는 각 구조 변이 후보의 특징들이 채워진 데이터를 기계학습을 위한 데이터셋(40)으로 저장한다.The feature extractor 110 stores data filled with features of each structural variation candidate as a dataset 40 for machine learning.

도 5는 한 실시예에 따른 유전체 구조 변이 식별을 위한 학습 방법의 흐름도이다.5 is a flowchart of a learning method for identifying a mutation in a dielectric structure according to an exemplary embodiment.

도 5를 참고하면, 장치(100)는 구조 변이 탐색 툴(10)로부터, 암-정상 전장 유전체 데이터(20)에서 탐색된 구조 변이 후보들을 입력받는다(S110).Referring to FIG. 5 , the device 100 receives, from the structural variation search tool 10 , structural variation candidates searched for in cancer-normal whole genome data 20 ( S110 ).

장치(100)는 암-정상 전장 유전체 데이터(20)에서 각 구조 변이 후보의 주변 지역을 조사하여 각 구조 변이 후보의 특징들을 추출한다(S120). 장치(100)는 구조 변이 후보별로 두 절단점(breakpoint1 및 breakpoint2) 주변을 조사하여 각 특징에 해당하는 값을 추출한다. 특징들은 변이 지지 리드들(variant-supporting reads)의 수, 보조 매핑 태그를 가진 리드들(split reads with supplementary alignment tag)의 수, 분리 리드들(split reads)의 수, 매핑 품질(mapping quality), 리드 깊이 변화(read depth change), 배경 노이즈로 분류되는 리드들(background noise reads)의 수, 정상조직 유전체 데이터 중에서 같은 변이가 검출된 샘플 수(panel of normal samples) 등을 추출한다. 이때, 장치(100)는 샘플의 임상 데이터로부터 암종(tumor histology), 전장 유전체가 유전체 중복인지를 나타내는 특징(whole-genome duplication), 샘플 내 암세포 비율(tumor purity), 암세포 유전체의 배수성(tumor ploidy)을 추가적으로 입력받을 수 있다. The apparatus 100 extracts features of each structural mutation candidate by examining a region surrounding each structural mutation candidate in the cancer-normal whole genome data 20 ( S120 ). The apparatus 100 extracts a value corresponding to each feature by examining the vicinity of two breakpoints (breakpoint1 and breakpoint2) for each structural variation candidate. Characteristics include number of variant-supporting reads, number of split reads with supplementary alignment tag, number of split reads, mapping quality, A read depth change, the number of background noise reads classified as background noise, the number of samples in which the same mutation is detected from among normal tissue genome data (panel of normal samples), etc. are extracted. In this case, the device 100 determines from the clinical data of the sample tumor histology, whole-genome duplication, tumor purity in the sample, and tumor ploidy of the cancer cell genome. ) can be additionally input.

장치(100)는 알려진 구조 변이 목록(30)을 참조하여 각 구조 변이 후보에 분류 클래스인 양성(True)/음성(False)을 레이블링한다(S130). 장치(100)는 각 구조 변이 후보가 구조 변이 목록(30)에 있는 구조 변이와 같은 방향성을 가지면서 500 염기서열 이내의 거리에 존재하면 양성, 그렇지 않으면 음성을 레이블링할 수 있다. The apparatus 100 labels each structural mutation candidate with a classification class of positive (True)/negative (False) with reference to the known structural mutation list 30 ( S130 ). The device 100 may label positive if each structural mutation candidate has the same orientation as the structural mutation in the structural mutation list 30 and is within a distance of 500 nucleotides, otherwise it may be labeled negative.

장치(100)는 각 구조 변이 후보의 특징들에 분류 클래스가 매핑된 데이터셋을 이용하여, 구조 변이 후보별 특징들로부터 해당 구조 변이 후보를 분류 클래스로 분류하도록 기계학습모델(200)을 학습시킨다(S140). 기계학습모델(200)은 구조 변이 후보별 특징들을 입력받고, 양성 또는 음성으로의 분류 확률(0과 1 사이의 확률값)을 출력할 수 있다. The apparatus 100 trains the machine learning model 200 to classify the structural variation candidate into a classification class from the features of each structural variation candidate by using a dataset in which classification classes are mapped to features of each structural variation candidate. (S140). The machine learning model 200 may receive features for each structural variation candidate and output a positive or negative classification probability (a probability value between 0 and 1).

학습부(150)는 암-정상 전장 유전체 데이터(20)에 포함된 검증용 샘플들을 이용하여, 기계학습모델(200)의 분류 성능을 검증한 후, 학습을 완료할 수 있다(S150). 이때, 학습부(150)는 분류 성능 검증을 기초로, 기계학습모델(200)에서 출력된 0과 1 사이의 확률값에 따라, 레이블링된 양성 또는 음성으로 분류하는 컷오프값을 결정할 수 있다. The learning unit 150 verifies the classification performance of the machine learning model 200 by using the verification samples included in the cancer-normal whole genome data 20 , and then completes the learning ( S150 ). In this case, the learning unit 150 may determine a cutoff value for classifying the labeled positive or negative according to a probability value between 0 and 1 output from the machine learning model 200 based on the classification performance verification.

도 6은 한 실시예에 따른 학습된 기계학습모델을 이용한 구조 변이 식별을 도식적으로 설명하는 도면이고, 도 7은 한 실시예에 따른 학습된 기계학습모델을 이용한 구조 변이 식별 방법의 흐름도이다.6 is a diagram schematically explaining structural variation identification using a learned machine learning model according to an embodiment, and FIG. 7 is a flowchart of a structural variation identification method using the learned machine learning model according to an embodiment.

도 6과 도 7을 참고하면, 장치(100)는 식별하고자 하는 전장 유전체 데이터에서 구조 변이들을 식별하기 위해, 특징 추출부(110) 및 학습된 기계학습모델(200)을 이용하고, 추가적으로 필터링부(170)를 이용할 수 있다. 이때, 특징 추출부(110), 필터링부(170) 및 학습된 기계학습모델(200)은 컴퓨터 프로그램으로 구현되고, 이를 설치한 임의의 컴퓨팅 장치에서 구동될 수 있다. 이외에도 도 2의 어노테이션부(130) 및 학습부(150)도 컴퓨터 프로그램으로 구현되고, 이를 설치한 컴퓨팅 장치에서 구동될 수 있다. 이 경우, 사용자가 학습된 기계학습모델(200)을 자신의 유전체 데이터로 재학습시킬 수 있다. 6 and 7 , the device 100 uses the feature extraction unit 110 and the learned machine learning model 200 to identify structural variations in whole genome data to be identified, and additionally a filtering unit (170) can be used. In this case, the feature extracting unit 110 , the filtering unit 170 , and the learned machine learning model 200 may be implemented as a computer program, and may be driven in any computing device in which they are installed. In addition, the annotation unit 130 and the learning unit 150 of FIG. 2 may also be implemented as computer programs, and may be driven by a computing device in which they are installed. In this case, the user can retrain the learned machine learning model 200 with his or her genome data.

특징 추출부(110)는 구조 변이 탐색 툴(10)로부터 전장 유전체 데이터(50)에서 탐색된 구조 변이 후보들을 입력받는다. 특징 추출부(110)는 각 구조 변이 후보의 주변 지역을 조사하여 기계학습모델(200)의 학습에 사용한 각 구조 변이 후보의 특징들을 추출한다. 특징 추출부(110)는 각 구조 변이 후보의 특징들을 학습된 기계학습모델(200)로 입력한다.The feature extractor 110 receives structural variation candidates searched for in the whole genome data 50 from the structural variation search tool 10 . The feature extraction unit 110 extracts features of each structural variation candidate used for learning the machine learning model 200 by examining the surrounding area of each structural variation candidate. The feature extraction unit 110 inputs the features of each structural variation candidate into the learned machine learning model 200 .

학습된 기계학습모델(200)은 입력된 특징들로부터 양성 또는 음성으로의 분류 확률값을 출력한다.The learned machine learning model 200 outputs a classification probability value as positive or negative from the input features.

그러면, 필터링부(170)는 학습된 기계학습모델(200)에서 출력된 각 구조 변이 후보의 분류 확률값을 컷오프값을 기준으로 음성 또는 양성으로 분류하여, 진정한 구조 변이를 식별할 수 있다. 그리고, 필터링부(170)는 구조 변이 후보들 중에서 음성으로 분류된 구조 변이 후보들을 필터링하여 제거하고, 양성으로 분류된 진양성 구조 변이들(True Positive Structural Variants, TP SVs)을 포함하는 진정한 구조 변이 목록을 생성할 수 있다.Then, the filtering unit 170 may classify the classification probability value of each structural variation candidate output from the learned machine learning model 200 as negative or positive based on the cutoff value, thereby discriminating the true structural variation. In addition, the filtering unit 170 filters and removes structural mutation candidates classified as negative among the structural mutation candidates, and a list of true structural mutations including True Positive Structural Variants (TP SVs) classified as positive. can create

도 7을 참고하면, 특징 추출부(110)는 구조 변이 탐색 툴(10)로부터, 전장 유전체 데이터(50)에서 탐색된 구조 변이 후보들을 입력받는다(S210).Referring to FIG. 7 , the feature extraction unit 110 receives structural variation candidates searched for in the whole genome data 50 from the structural variation search tool 10 ( S210 ).

특징 추출부(110)는 각 구조 변이 후보의 주변 지역을 조사하여 각 구조 변이 후보의 특징들을 추출한다(S220). 특징 추출부(110)는 표 1의 특징들을 추출하기 위해 설정된 사용자 정의 함수들을 통해 각 구조 변이 후보의 주변 지역을 조사할 수 있다. 이때, 절단점의 주변 지역을 조사할 범위는 구조 변이 종류나 추출 특징에 따라 다르게 설정될 수 있다. The feature extraction unit 110 extracts features of each structural variation candidate by examining a surrounding area of each structural variation candidate ( S220 ). The feature extractor 110 may examine the surrounding area of each structural variation candidate through user-defined functions set to extract the features of Table 1 . In this case, the range to be investigated in the vicinity of the cut point may be set differently according to the type of structural variation or extraction characteristics.

특징 추출부(110)는 각 구조 변이 후보의 특징들을 학습된 기계학습모델(200)로 입력한다(S230).The feature extraction unit 110 inputs the features of each structural variation candidate to the learned machine learning model 200 (S230).

학습된 기계학습모델(200)은 입력된 특징들로부터 양성 또는 음성으로의 분류 확률값을 출력한다(S240).The learned machine learning model 200 outputs a classification probability value as positive or negative from the input features (S240).

필터링부(170)는 학습된 기계학습모델(200)에서 출력된 각 구조 변이 후보의 분류 확률값을 컷오프값을 기준으로 음성 또는 양성으로 판단하여 음성 구조 변이와 양성 구조 변이로 식별한다(S250).The filtering unit 170 determines the classification probability value of each structural mutation candidate output from the learned machine learning model 200 as negative or positive based on the cutoff value, and identifies the structural mutation as negative and positive ( S250 ).

필터링부(170)는 전장 유전체 데이터(50)에서 탐색된 구조 변이 후보들 중에서, 음성 구조 변이로 식별된 후보들을 제거하고, 진정한 구조 변이 목록을 생성한다(S260). 구조 변이 탐색 툴(10)이 탐색한 구조 변이 후보들에 위양성이 포함되어 있으므로, 필터링부(170)는 구조 변이 후보들 중에서 기계학습모델(200)이 음성으로 분류한 구조 변이 후보들을 위양성으로 판단하여 제거한다.The filtering unit 170 removes candidates identified as negative structural mutations from among the structural mutation candidates searched for in the whole genome data 50 and generates a true structural mutation list ( S260 ). Since the structural variation candidates searched by the structural variation search tool 10 contain false positives, the filtering unit 170 determines that structural variation candidates classified as negative by the machine learning model 200 among the structural variation candidates are false positives and removes them. do.

이처럼, 실시예에 따르면, 기계학습모델을 통해 유전체에서 다양한 구조 변이를 정확하게 검출할 수 있고, 위양성을 제거하여 정확한 구조 변이 목록을 생성할 수 있어서, 암 진단을 비롯한 종양학의 발전에 기여할 수 있다.As such, according to the embodiment, various structural variations in the genome can be accurately detected through the machine learning model, and an accurate list of structural variations can be generated by removing false positives, thereby contributing to the development of oncology including cancer diagnosis.

실시예에 따르면, 표준화된 구조 변이 후보의 특징들로 학습된 기계학습모델을 이용하는 경우, 구조 변이 후보들에서 위양성을 빠르게 제거할 수 있다. According to an embodiment, when a machine learning model trained with the features of the standardized structural mutation candidates is used, false positives can be quickly removed from the structural mutation candidates.

도 8은 한 실시예에 따른 컴퓨팅 장치의 하드웨어 구성도이다.8 is a hardware configuration diagram of a computing device according to an embodiment.

도 8을 참고하면, 장치(100)는 적어도 하나의 프로세서에 의해 동작하는 컴퓨팅 장치(400)로 구현될 수 있다. 또는 특징 추출부(110), 어노테이션부(130), 학습부(150), 필터링부(170), 기계학습모델(200) 각각은 적어도 하나의 프로세서에 의해 동작하는 컴퓨팅 장치(400)에 탑재될 수 있다.Referring to FIG. 8 , the device 100 may be implemented as a computing device 400 operated by at least one processor. Alternatively, each of the feature extraction unit 110 , the annotation unit 130 , the learning unit 150 , the filtering unit 170 , and the machine learning model 200 may be mounted on the computing device 400 operated by at least one processor. can

컴퓨팅 장치(400)는 하나 이상의 프로세서(410), 프로세서(410)에 의하여 수행되는 컴퓨터 프로그램을 로드하는 메모리(430), 컴퓨터 프로그램 및 각종 데이터를 저장하는 저장 장치(450), 통신 인터페이스(470), 그리고 이들을 연결하는 버스(490)를 포함할 수 있다. 이외에도, 컴퓨팅 장치(400)는 다양한 구성 요소가 더 포함될 수 있다. The computing device 400 includes one or more processors 410 , a memory 430 for loading a computer program executed by the processor 410 , a storage device 450 for storing computer programs and various data, and a communication interface 470 . , and a bus 490 connecting them. In addition, the computing device 400 may further include various components.

프로세서(410)는 컴퓨팅 장치(400)의 동작을 제어하는 장치로서, 컴퓨터 프로그램에 포함된 명령어들을 처리하는 다양한 형태의 프로세서일 수 있고, 예를 들면, CPU(Central Processing Unit), MPU(Micro Processor Unit), MCU(Micro Controller Unit), GPU(Graphic Processing Unit) 또는 본 개시의 기술 분야에 잘 알려진 임의의 형태의 프로세서 중 적어도 하나를 포함하여 구성될 수 있다. The processor 410 is a device for controlling the operation of the computing device 400 , and may be various types of processors that process instructions included in a computer program, for example, a central processing unit (CPU), a micro processor (MPU) Unit), a micro controller unit (MCU), a graphic processing unit (GPU), or any type of processor well known in the art of the present disclosure.

메모리(430)는 각종 데이터, 명령 및/또는 정보를 저장한다. 메모리(430)는 본 개시의 동작을 실행하도록 기술된 명령어들이 프로세서(410)에 의해 처리되도록 해당 컴퓨터 프로그램을 저장 장치(450)로부터 로드할 수 있다. 메모리(430)는 예를 들면, ROM(read only memory), RAM(random access memory) 등 일 수 있다. The memory 430 stores various data, commands and/or information. The memory 430 may load a corresponding computer program from the storage device 450 so that the instructions described to execute the operations of the present disclosure are processed by the processor 410 . The memory 430 may be, for example, read only memory (ROM), random access memory (RAM), or the like.

저장 장치(450)는 컴퓨터 프로그램, 각종 데이터를 비임시적으로 저장할 수 있다. 저장 장치(450)는 ROM(Read Only Memory), EPROM(Erasable Programmable ROM), EEPROM(Electrically Erasable Programmable ROM), 플래시 메모리 등과 같은 비휘발성 메모리, 하드 디스크, 착탈형 디스크, 또는 본 개시가 속하는 기술 분야에서 잘 알려진 임의의 형태의 컴퓨터로 읽을 수 있는 기록 매체를 포함하여 구성될 수 있다.The storage device 450 may non-temporarily store a computer program and various data. The storage device 450 is a non-volatile memory, such as a read only memory (ROM), an erasable programmable ROM (EPROM), an electrically erasable programmable ROM (EEPROM), a flash memory, a hard disk, a removable disk, or in the art to which the present disclosure pertains. It may be configured to include any well-known computer-readable recording medium.

통신 인터페이스(470)는 유/무선 통신을 지원하는 유/무선 통신 모듈일 수 있다. The communication interface 470 may be a wired/wireless communication module supporting wired/wireless communication.

버스(490) 컴퓨팅 장치(400)의 구성 요소 간 통신 기능을 제공한다. The bus 490 provides a communication function between the components of the computing device 400 .

컴퓨터 프로그램은, 프로세서(410)에 의해 실행되는 명령어들(instructions)을 포함하고, 비일시적-컴퓨터 판독가능 저장매체(non-transitory computer readable storage medium)에 저장되며, 명령어들은 프로세서(410)가 본 개시의 동작을 실행하도록 만든다. 컴퓨터 프로그램은 네트워크를 통해 다운로드되거나, 제품 형태로 판매될 수 있다. The computer program includes instructions executed by the processor 410 , and is stored in a non-transitory computer readable storage medium, wherein the instructions are read by the processor 410 . Make the action of initiation to be executed. The computer program may be downloaded over a network or sold as a product.

학습을 위한 컴퓨터 프로그램은 구조 변이 탐색 툴로부터, 암 유전체 데이터와 정상조직 유전체 데이터의 쌍으로 구성된 전장 유전체 데이터에서 탐색된 구조 변이 후보들을 입력받는 단계, 전장 유전체 데이터에서 각 구조 변이 후보의 주변 지역을 조사하여 각 구조 변이 후보의 특징들을 추출하는 단계, 알려진 구조 변이 목록을 참조하여 각 구조 변이 후보에 분류 클래스인 양성 또는 음성을 레이블링하는 단계, 그리고 각 구조 변이 후보의 특징들에 분류 클래스가 매핑된 데이터셋을 이용하여, 구조 변이 후보별 특징들로부터 해당 구조 변이 후보를 분류 클래스로 분류하도록 기계학습모델을 학습시키는 단계를 실행하도록 기술된 명령어들을 포함할 수 있다. 학습을 위한 컴퓨터 프로그램은 전장 유전체 데이터에 포함된 검증용 샘플들을 이용하여, 기계학습모델의 분류 성능을 검증한 후, 학습을 완료하는 단계, 그리고 기계학습모델의 분류 성능 검증을 기초로, 기계학습모델에서 출력된 확률값을 양성 또는 음성으로 분류하는 컷오프값을 결정하는 단계를 실행하도록 기술된 명령어들을 더 포함할 수 있다. The computer program for learning receives, from the structural variation search tool, structural variation candidates searched for in whole genome data consisting of a pair of cancer genome data and normal tissue genome data, and the region surrounding each structural variation candidate in the whole genome data. Investigating and extracting features of each structural mutation candidate, labeling each structural mutation candidate positive or negative as a classification class with reference to a list of known structural mutation candidates, and mapping the classification class to the features of each structural mutation candidate It may include instructions described to execute the step of training the machine learning model to classify the corresponding structural mutation candidate into a classification class from the features of each structural mutation candidate using the dataset. The computer program for learning verifies the classification performance of the machine learning model using the verification samples included in the whole genome data, then completes the learning, and based on the verification of the classification performance of the machine learning model, machine learning It may further include instructions described to execute the step of determining a cutoff value for classifying the probability value output from the model as positive or negative.

이때, 학습을 위한 컴퓨터 프로그램은 각 구조 변이 후보에서 추출하는 특징들을 저장할 수 있다. 특징들은 변이 지지 리드들(variant-supporting reads)의 수, 보조 매핑 태그를 가진 리드들(split reads with supplementary alignment tag)의 수, 분리 리드들(split reads)의 수, 매핑 품질(mapping quality), 리드 깊이 변화(read depth change), 그리고 배경 노이즈로 분류되는 리드들(background noise reads)의 수, 정상조직 유전체 데이터 중에서 같은 변이가 검출된 샘플 수(panel of normal samples)를 포함할 수 있다. 특징들은 임상 데이터로부터 추출되는 암종(tumor histology), 전장 유전체가 유전체 중복인지를 나타내는 특징(whole-genome duplication), 샘플 내 암세포 비율(tumor purity), 그리고 암세포 유전체의 배수성(tumor ploidy)을 더 포함할 수 있다.In this case, the computer program for learning may store features extracted from each structural variation candidate. Characteristics include number of variant-supporting reads, number of split reads with supplementary alignment tag, number of split reads, mapping quality, It may include a read depth change, the number of background noise reads classified as background noise, and the number of samples in which the same mutation is detected in normal tissue genome data (panel of normal samples). Characteristics further include tumor histology extracted from clinical data, whole-genome duplication, tumor purity in the sample, and tumor ploidy of the cancer cell genome. can do.

한편, 학습을 위한 컴퓨터 프로그램은 구조 변이 후보별 특징들을 입력받고, 양성 또는 음성으로의 분류 확률값을 출력하는 기계학습모델을 포함할 수 있다.Meanwhile, the computer program for learning may include a machine learning model that receives features for each structural variation candidate and outputs a positive or negative classification probability value.

학습된 이후, 구조 변이 식별을 위해 배포되는 컴퓨터 프로그램은 구조 변이 후보별 특징들로부터 해당 구조 변이 후보를 분류 클래스로 분류하도록 학습된 제1 컴퓨터 프로그램, 그리고 제2 컴퓨터 프로그램을 포함할 수 있다. 제2 컴퓨터 프로그램은 식별하고자 하는 전장 유전체 데이터에서 탐색된 구조 변이 후보들을 입력받고, 전장 유전체 데이터에서 각 구조 변이 후보의 주변 지역을 조사하여 각 구조 변이 후보의 특징들을 추출한 후, 기계학습모델로 각 구조 변이 후보의 특징들을 입력하여, 해당 구조 변이 후보에 대한 양성 또는 음성으로의 분류 확률값을 획득하고, 설정된 컷오프값을 기초로 분류 확률값을 음성 또는 양성으로 판단하여 각 구조 변이 후보를 음성 구조 변이 또는 양성 구조 변이로 식별하도록 기술된 명령어들을 포함할 수 있다. After being learned, the computer program distributed for structural variation identification may include a first computer program trained to classify a corresponding structural variation candidate into a classification class from features of each structural variation candidate, and a second computer program. The second computer program receives the structural mutation candidates searched for in the whole genome data to be identified, searches the surrounding area of each structural mutation candidate in the whole genome data, extracts the features of each structural mutation candidate, and then uses each structural mutation candidate as a machine learning model. By inputting the characteristics of the structural mutation candidate, a positive or negative classification probability value for the structural mutation candidate is obtained, and the classification probability value is determined as negative or positive based on the set cutoff value to determine each structural mutation candidate as a negative structural mutation or It may include instructions that are described to identify as benign structural variants.

제1 컴퓨터 프로그램은 로지스틱 회귀(logistic regression) 모델, 랜덤 포레스트(random forest) 모델, 확률적 경사 부스팅(stochastic gradient boosting) 모델 중 적어도 하나로 구현된 기계학습모델일 수 있다. The first computer program may be a machine learning model implemented by at least one of a logistic regression model, a random forest model, and a stochastic gradient boosting model.

제2 컴퓨터 프로그램은 구조 변이 탐색 툴에서 탐색된 구조 변이 후보들 중에서, 음성 구조 변이로 식별된 구조 변이 후보들을 제거하여, 진정한 구조 변이 목록을 생성하는 명령어들을 더 포함할 수 있다. 제2 컴퓨터 프로그램은 각 구조 변이 후보에서 추출하는 특징들을 저장하는데, 특징들은 변이 지지 리드들(variant-supporting reads)의 수, 보조 매핑 태그를 가진 리드들(split reads with supplementary alignment tag)의 수, 분리 리드들(split reads)의 수, 매핑 품질(mapping quality), 리드 깊이 변화(read depth change), 그리고 배경 노이즈로 분류되는 리드들(background noise reads)의 수, 정상조직 유전체 데이터 중에서 같은 변이가 검출된 샘플 수(panel of normal samples)를 포함할 수 있다. 제2 컴퓨터 프로그램은 임상 데이터로부터 각 구조 변이 후보의 특징들을 추가적으로 추출하는 명령어들을 더 포함할 수 있고, 추가적으로 추출되는 특징들은 암종(tumor histology), 전장 유전체가 유전체 중복인지를 나타내는 특징(whole-genome duplication), 샘플 내 암세포 비율(tumor purity), 그리고 암세포 유전체의 배수성(tumor ploidy)을 포함할 수 있다. The second computer program may further include instructions for removing the structural variation candidates identified as negative structural variation candidates from among the structural variation candidates searched by the structural variation search tool to generate a list of true structural variation. The second computer program stores features extracted from each structural variant candidate, the features including the number of variant-supporting reads, the number of split reads with supplementary alignment tag, The number of split reads, mapping quality, read depth change, and number of background noise reads classified as background noise, the same variation in normal tissue genomic data It may include the number of detected samples (panel of normal samples). The second computer program may further include instructions for additionally extracting features of each structural mutation candidate from clinical data, and the additionally extracted features include tumor histology and a whole-genome indicating whether the whole genome is a genome overlap. duplication), tumor purity in the sample, and tumor ploidy of the cancer cell genome.

이상에서 설명한 본 발명의 실시예는 장치 및 방법을 통해서만 구현이 되는 것은 아니며, 본 발명의 실시예의 구성에 대응하는 기능을 실현하는 프로그램 또는 그 프로그램이 기록된 기록 매체를 통해 구현될 수도 있다.The embodiment of the present invention described above is not implemented only through the apparatus and method, and may be implemented through a program for realizing a function corresponding to the configuration of the embodiment of the present invention or a recording medium in which the program is recorded.

이상에서 본 발명의 실시예에 대하여 상세하게 설명하였지만 본 발명의 권리범위는 이에 한정되는 것은 아니고 다음의 청구범위에서 정의하고 있는 본 발명의 기본 개념을 이용한 당업자의 여러 변형 및 개량 형태 또한 본 발명의 권리범위에 속하는 것이다.Although the embodiments of the present invention have been described in detail above, the scope of the present invention is not limited thereto, and various modifications and improvements by those skilled in the art using the basic concept of the present invention as defined in the following claims are also provided. is within the scope of the right.

Claims

A method for the identification of genomic structure variations performed by a computing device, comprising:
obtaining structural variation candidates identified from full-length genome data composed of a pair of cancer genome data and normal tissue genome data;
extracting features of each structural mutation candidate based on data located within a predetermined range from each structural mutation candidate in the whole genome data;
labeling each structural variation candidate with classification information based on the known structural variation list; and
Learning a machine learning model using a dataset to which the characteristics of each structural variation candidate and the labeled classification information are mapped
including,
The step of extracting the features of each structural variation candidate comprises:
For each structural variant candidate, from clinical data, tumor histology, whole-genome duplication, tumor purity in the sample, and tumor ploidy of the cancer cell genome ) comprising the step of extracting,
wherein the machine learning model receives an identification target structure variation candidate and outputs a classification of the identification target structure variation candidate;
Methods for the identification of genomic structural variations.

According to claim 1,
Before the step of extracting features of each structural variation candidate, the estimated position of the first structural variation candidate is determined based on supplementary alignment tag (SA) tag information associated with the first structural variation candidate among the structural variation candidates. comprising the step of modifying
Methods for the identification of genomic structural variations.

3. The method of claim 2,
The step of modifying the estimated position of the first structural variation candidate includes identifying a first region and a second region on a reference sequence to which the first structural variation candidate is mapped,
The first region and the second region are regions that are not continuous with each other,
Methods for the identification of genomic structural variations.

3. The method of claim 2,
The modifying the estimated position of the first structural variation candidate includes determining a position of a breakpoint associated with the first structural variation candidate.
Methods for the identification of genomic structural variations.

The method of claim 1,
The step of extracting features of each structural mutation candidate based on data located within a predetermined range from each structural mutation candidate may include:
In the full-length dielectric data, from a cut point associated with a first structural mutation candidate among the structural mutation candidates, acquiring data of a first length or more in a direction in which mutation support leads of the first structural mutation candidate are located ,
Methods for the identification of genomic structural variations.

6. The method of claim 5,
The first length is an average size of inserts of the full-length dielectric data,
Methods for the identification of genomic structural variations.

7. The method of claim 6,
The characteristics of the structural mutation candidate are:
The number of variant-supporting reads, the number of split reads, the number of split reads with supplementary alignment tags, and the number of separate reads in normal tissue genomic data comprising at least one of the number of reads having the same clipped sequences with split reads within normal WGS data,
Methods for the identification of genomic structural variations.

According to claim 1,
The step of extracting features of each structural mutation candidate based on data located within a predetermined range from each structural mutation candidate may include:
In the full-length dielectric data, from a cut point associated with a first structural mutation candidate among the structural mutation candidates, acquiring data within a second length in a direction opposite to mutation support leads of the first structural mutation candidate ,
Methods for the identification of genomic structural variations.

9. The method of claim 8,
The second length is 200 bp (base pair),
Methods for the identification of genomic structural variations.

According to claim 1,
The step of labeling the classification information comprises:
labeling a first structural mutation candidate among the structural mutation candidates as positive, wherein in the first structural mutation candidate, the position of the first structural mutation candidate is marked as positive in the known structural mutation list; and
labeling a second structural mutation candidate among the structural mutation candidates as negative, wherein in the second structural mutation candidate, the position of the second structural mutation candidate is negatively marked in the known structural mutation list;
comprising any one of
Methods for the identification of genomic structural variations.

According to claim 1,
The step of extracting the features of each structural variation candidate comprises:
For each structural variant candidate, the number of variant-supporting reads, the number of split reads with supplementary alignment tag, the number of split reads, the mapping quality Extraction of (mapping quality), read depth change (read depth change), the number of background noise reads classified as background noise, and the number of samples in which the same mutation was detected among normal tissue genome data (panel of normal samples) comprising the step of
Methods for the identification of genomic structural variations.

delete

According to claim 1,
The machine learning model is
a machine learning model that receives characteristics of the target structure variation candidate to be identified and outputs a probability value for classifying the target structure variation candidate as positive or negative;
Methods for the identification of genomic structural variations.

According to claim 1,
verifying the classification performance of the machine learning model by using the verification samples included in the whole genome data; and
Determining a cutoff value for determining whether the probability value output from the machine learning model indicates positive or negative based on the classification performance verification result of the machine learning model
further comprising,
Methods for the identification of genomic structural variations.

According to claim 1,
Obtaining the structural variation candidates identified from the full-length genome data comprises:
inputting the whole genome data into a structural variation search tool; and
obtaining structural variation candidates output by the structural variation search tool;
containing,
Methods for the identification of genomic structural variations.

A method for the identification of genomic structure variations performed by a computing device, comprising:
obtaining structural mutation candidates from the whole genome data to be identified;
extracting features of each structural mutation candidate based on data located within a predetermined range from each structural mutation candidate in the whole genome data; and
inputting the extracted features of each structural mutation candidate into a learned machine learning model, and identifying each structural mutation candidate as a negative structural mutation candidate or a positive structural mutation candidate;
The step of extracting the features of each structural variation candidate comprises:
For each structural variant candidate, from clinical data, tumor histology, whole-genome duplication, tumor purity in the sample, and tumor ploidy of the cancer cell genome ) comprising the step of extracting,
The machine learning model is
It is a model trained using the characteristics of the structural mutation candidate for learning obtained from full-length genome data composed of a pair of cancer genome data and normal tissue genomic data and classification information of the structural mutation candidate for learning,
Methods for the identification of genomic structural variations.

17. The method of claim 16,
generating a list of true structural variation by removing structural variation candidates identified as the negative structural variation among the obtained structural variation candidates;
further comprising,
Methods for the identification of genomic structural variations.

A computer-readable non-transitory storage medium comprising instructions, comprising:
The instructions, when executed by one or more processors of a computing device, cause the computing device to:
obtaining structural variation candidates identified from full-length genome data composed of a pair of cancer genome data and normal tissue genome data;
extracting features of each structural mutation candidate based on data located within a predetermined range from each structural mutation candidate in the whole genome data;
labeling each structural variation candidate with classification information based on the known structural variation list; and
Learning a machine learning model using a dataset to which the characteristics of each structural variation candidate and the labeled classification information are mapped
but to perform an operation comprising
The step of extracting the features of each structural variation candidate comprises:
For each structural variant candidate, from clinical data, tumor histology, whole-genome duplication, tumor purity in the sample, and tumor ploidy of the cancer cell genome ) comprising the step of extracting,
wherein the machine learning model receives an identification target structure variation candidate and outputs a classification of the identification target structure variation candidate;
A computer-readable, non-transitory storage medium.

A computing device comprising:
processor; and
memory to store instructions
including,
The instructions, when executed by the processor, cause the computing device to:
obtaining structural variation candidates identified from full-length genome data composed of a pair of cancer genome data and normal tissue genome data;
extracting features of each structural mutation candidate based on data located within a predetermined range from each structural mutation candidate in the whole genome data;
labeling each structural variation candidate with classification information based on the known structural variation list; and
Learning a machine learning model using a dataset to which the characteristics of each structural variation candidate and the labeled classification information are mapped
but to perform an operation comprising
The step of extracting the features of each structural variation candidate comprises:
For each structural variant candidate, from clinical data, tumor histology, whole-genome duplication, tumor purity in the sample, and tumor ploidy of the cancer cell genome ) comprising the step of extracting,
wherein the machine learning model receives an identification target structure variation candidate and outputs a classification of the identification target structure variation candidate;
computing device.