KR102347464B1

KR102347464B1 - A method and apparatus for determining true positive variation in nucleic acid sequencing analysis

Info

Publication number: KR102347464B1
Application number: KR1020200020550A
Authority: KR
Inventors: 신승호; 박동현
Original assignee: 지니너스 주식회사
Priority date: 2020-02-19
Filing date: 2020-02-19
Publication date: 2022-01-06
Also published as: KR20210105725A

Abstract

본 출원은 핵산 서열 분석에서 위양성 변이를 검출하는 방법 및 장치에 관한 것으로, 일 양태로서 제공되는 진양성 변이를 판별하는 방법 및 장치는, 샘플의 특성에 따라 다르게 나타나는 오류 분포를 파악하고, 해당 정보를 변이 검출시 반영함으로써, 진양성 변이를 보다 정확하게 판별할 수 있다.The present application relates to a method and apparatus for detecting a false-positive mutation in nucleic acid sequence analysis, and a method and apparatus for determining a true-positive mutation, provided as an aspect, identify an error distribution that appears differently depending on the characteristics of a sample, and the information By reflecting in the mutation detection, true-positive mutations can be more accurately discriminated.

Description

A method and apparatus for determining true positive variation in nucleic acid sequencing analysis

핵산서열 분석에서 진양성 변이를 검출하는 방법 및 장치에 관한 것이다.It relates to a method and apparatus for detecting true positive mutations in nucleic acid sequencing.

NGS(Next Generation Sequencing)의 빠른 발전에 따라 유전체 데이터를 활용한 암 진단 및 변이 추적을 위한 연구가 활발히 진행되고 있다. 특히 최근 액체생검을 활용한 암 유전체 분석 서비스의 출현으로 인해 낮은 수준으로 존재하는 변이를 검출하는 방법에 대한 중요성이 커지고 있다. 액체 생검뿐만 아니라 NGS를 활용한 여러 암 유전체 분석에서도 조직에서의 암세포 비율에 따라 낮은 수준으로 존재하는 변이를 정확하게 검출하는 것이 중요하다.With the rapid development of NGS (Next Generation Sequencing), research for cancer diagnosis and mutation tracking using genomic data is being actively conducted. In particular, due to the recent emergence of cancer genome analysis services using liquid biopsy, the importance of methods for detecting low-level mutations is increasing. It is important to accurately detect mutations present at low levels depending on the percentage of cancer cells in the tissue not only in liquid biopsy but also in various cancer genome analyzes using NGS.

낮은 수준으로 존재하는 변이의 경우 그 수준에 따라 검출을 위해 적절한 분석 플랫폼이 권장되고 있다. 하지만 기존에 알려진 바와 같이 NGS를 활용한 경우 일반적인 변이 검출의 수준이 1~2%이상의 변이에 한정된다.For mutations present at low levels, an appropriate analysis platform is recommended for detection according to the level. However, as previously known, when NGS is used, the level of general mutation detection is limited to mutations greater than 1-2%.

낮은 수준의 변이를 검출하기 어려운 이유 중 대표적인 것은 서열정보를 생성하며 동반되는 오류와 구분이 어렵기 때문이다. 그 중에서도 1% 미만의 변이의 경우 DNA의 처리과정 및 NGS를 통한 서열 데이터 생성 중 발생하는 오류와 구분하기 어렵다.One of the reasons why it is difficult to detect low-level mutations is that it is difficult to distinguish them from errors that accompany the generation of sequence information. Among them, mutations of less than 1% are difficult to distinguish from errors that occur during DNA processing and sequence data generation through NGS.

이런 문제를 극복하기 위한 방법들은 여러 가지가 알려져 있다. NGS 분석 이외에도 DNA를 처리하는 과정에서 발생하는 오류들을 효과적으로 제거하는 방법들이 소개된 바 있으며, 최근에는 digital error suppression 방법을 사용하여 시퀀싱 과정에서 발생하는 오류들을 효과적으로 제거할 수 있음이 보고되었다. Several methods are known to overcome this problem. In addition to NGS analysis, methods for effectively removing errors occurring during DNA processing have been introduced. Recently, it has been reported that errors occurring during the sequencing process can be effectively removed using the digital error suppression method.

이 밖에도 통계적인 방법을 통해 낮은 수준의 변이를 검출하는 전략으로는, 변이가 없다고 가정되는 같은 환자의 정상 DNA(matched-normal gDNA)를 활용하는 방법이 있다. 암 유전체 분석의 경우 생식세포 변이 제거를 위해 정상 샘플의 획득 및 접근이 용이한 WBC(White Blood Cell)의 gDNA를 정상샘플로서 많이 활용한다. 이에 따라 WBC를 활용하여 변이 검출의 정확도를 높이는 연구들 또한 보고되었다. 조직의 Whole exome sequencing의 경우, 혈액의 WBC gDNA를 분석 시 circulating tumor DNA의 존재에 의해 체세포 변이가 생식세포 변이로 오인되어 검출이 되지 않는 것을 개선한 cmDetect가 2016년 발표되었다.In addition, as a strategy for detecting low-level mutations through statistical methods, there is a method of using matched-normal gDNA from the same patient, which is assumed to have no mutations. In the case of cancer genome analysis, gDNA of White Blood Cell (WBC), which is easy to obtain and access normal samples, is widely used as a normal sample in order to remove germ cell mutations. Accordingly, studies to improve the accuracy of mutation detection using WBC have also been reported. In the case of whole exome sequencing of tissues, cmDetect, which improves the detection of somatic cell mutations that are mistaken for germ cell mutations due to the presence of circulating tumor DNA when analyzing WBC gDNA in blood, was released in 2016.

다른 방법으로 WBC의 gDNA를 활용하는 방법은 위치 특이적 오류 패턴을 활용한 필터링 방법이 있다. WBC의 경우 체세포 변이가 없다고 가정하기 때문에 SNP를 제외한 나머지 변이들은 모두 오류로 간주할 때, 분석 대상 샘플에서 발견된 변이와 비교하여 그 변이 수준이 통계적으로 유의하게 높을 경우 변이로 선별한다(예시, WBC의 gDNA를 사용한 변이 수준의 분포와 검출된 변이 후보의 변이 수준을 Z-표준화 등을 통해 비교 검증).Another method of using the WBC gDNA is a filtering method using a position-specific error pattern. In the case of WBC, since it is assumed that there is no somatic mutation, all mutations except for SNP are considered as errors. The distribution of the mutation level using the WBC gDNA and the mutation level of the detected mutation candidate are compared and verified through Z-standardization, etc.).

그러나, 조직을 비롯하여 액체 생검의 cfDNA(cell-free DNA)등 분석 준비과정에서 DNA 처리과정에 차이가 있거나, 사용하는 시약 및 실험 과정의 차이에 따라 오류의 분포가 다르게 나타난다. 이와 같은 차이에 대한 보정을 적절히 적용해야만 WGC gDNA의 오류 분포를 활용한 진양성 변이 검출의 정확도를 확보할 수 있다.However, the distribution of errors is different depending on the difference in the DNA processing process in the preparation process for analysis, such as cfDNA (cell-free DNA) of liquid biopsies, including tissues, or the reagents and experimental procedures used. Only when correction for such a difference is properly applied can the accuracy of detecting true-positive mutations using the error distribution of WGC gDNA be secured.

일 양태로서, 핵산서열 분석에서 진양성 변이를 판별하는 방법을 제공한다. In one aspect, a method for discriminating a true positive mutation in nucleic acid sequence analysis is provided.

일 양태로서, 상기 진양성 변이를 판별하는 방법을 수행하기 위한 컴퓨터 프로그램이 수록된 컴퓨터로 읽을 수 있는 기록매체를 제공한다.In one aspect, there is provided a computer-readable recording medium in which a computer program for performing the method for determining the true-positive mutation is recorded.

일 양태로서, 핵산서열 분석에서 진양성 변이를 판별하는 장치를 제공한다. In one aspect, there is provided an apparatus for discriminating true positive mutations in nucleic acid sequence analysis.

본 출원의 다른 목적 및 이점은 첨부한 청구범위와 함께 하기의 상세한 설명에 의해 보다 명확해질 것이다. 본 명세서에 기재되지 않은 내용은 본 출원의 기술 분야 또는 유사한 기술 분야 내 숙련된 자이면 충분히 인식하고 유추할 수 있는 것이므로 그 설명을 생략한다.Other objects and advantages of the present application will become more apparent from the following detailed description in conjunction with the appended claims. Content not described in this specification will be omitted because it can be sufficiently recognized and inferred by those skilled in the technical field or similar technical field of the present application.

일 양태로서, (a) 정상세포 또는 변이세포 각각으로부터 유래된, 체세포 변이와 생식세포 변이가 포함되지 않은 핵산에 대한 정상세포의 서열분석 오류분포 데이터 및 변이세포의 서열분석 오류분포 데이터를 독립적으로 획득하는 단계; (b) 상기 데이터로부터 정상세포의 서열분석 오류분포 평균값 대비 변이세포의 서열분석 오류분포 평균값의 비율인 ω값을 산출하는 단계; (c) 상기 산출된 ω값을 정상세포의 서열분석 오류분포 데이터에 가중치로 부여하여, 수정된 서열분석 오류분포 데이터를 획득하는 단계; 및 (d) 상기 수정된 서열분석 오류분포 데이터와, 인간 참조 유전체 서열에 정렬된 분석 대상 시료 내 핵산서열 분석 데이터로부터 획득한 변이 확률 분포 값간 통계적 유의성을 평가하는 단계를 포함하는, 진양성 변이를 판별하는 방법을 제공한다.In one aspect, (a) the sequencing error distribution data of normal cells and the sequencing error distribution data of mutant cells for nucleic acids derived from normal cells or mutant cells that do not contain somatic and germline mutations are independently analyzed obtaining; (b) calculating an ω value, which is a ratio of the average sequencing error distribution of mutant cells to the average sequencing error distribution of normal cells, from the data; (c) assigning the calculated ω value to the sequencing error distribution data of normal cells as a weight to obtain corrected sequencing error distribution data; and (d) evaluating the statistical significance between the modified sequencing error distribution data and the mutation probability distribution value obtained from the nucleic acid sequencing data in the analysis target sample aligned with the human reference genome sequence. It provides a way to determine

일 양태로서, 상기 방법을 수행하기 위한 컴퓨터 프로그램이 수록된 컴퓨터로 읽을 수 있는 기록매체를 제공한다.In one aspect, there is provided a computer-readable recording medium in which a computer program for performing the method is recorded.

일 양태로서, 정상세포 또는 변이세포 각각으로부터 유래된, 체세포 변이와 생식세포 변이가 포함되지 않은 핵산에 대하여, 정상세포의 서열분석 오류분포 데이터 및 변이세포의 서열분석 오류분포 데이터를 독립적으로 획득하는 데이터 수집부; 상기 데이터로부터 정상세포의 서열분석 오류분포 평균값 대비 변이세포의 서열분석 오류분포 평균값의 비율인 ω값을 산출하는 데이터 해석부; 상기 산출된 ω값을 정상세포의 서열분석 오류분포 데이터에 가중치로 부여하여, 수정된 서열분석 오류분포 데이터를 획득하는 데이터 수정부; 및 상기 수정된 서열분석 오류분포 데이터와 인간 참조 유전체 서열에 정렬된 분석 대상 시료 내 핵산서열 분석 데이터로부터 획득한 변이 확률 분포 값간 통계적 유의성을 평가하는 데이터 적용부를 포함하는, 핵산서열 분석에서 진양성 변이를 판별하는 장치를 제공한다.In one aspect, with respect to nucleic acids derived from normal cells or mutant cells, each of which does not contain somatic and germline mutations, sequencing error distribution data of normal cells and sequencing error distribution data of mutant cells are independently obtained. data collection unit; a data analysis unit for calculating an ω value, which is a ratio of an average sequencing error distribution of mutant cells to an average sequencing error distribution of normal cells, from the data; a data correction unit for obtaining corrected sequencing error distribution data by assigning the calculated ω value to the sequencing error distribution data of normal cells as a weight; and a data application unit for evaluating statistical significance between the modified sequencing error distribution data and the mutation probability distribution value obtained from the nucleic acid sequencing data in the analysis target sample aligned with the human reference genome sequence. A device for determining the

일 양태로서 제공되는 핵산서열 분석에서 진양성 변이를 판별하는 방법은, 샘플의 특성에 따라 다르게 나타나는 오류 분포를 파악하고 해당 정보를 변이 검출시 활용하는 방법이다. 구체적으로, 일 양태에 따른 방법은 수정된 배경오류 분포와 분석 대상 시료 내 핵산서열 분석 데이터로부터 획득한 변이 확률 분포를 비교함으로써, 진양성 변이를 보다 정확하게 판별할 수 있다.A method of determining a true-positive mutation in nucleic acid sequence analysis provided as an embodiment is a method of identifying an error distribution that appears differently depending on the characteristics of a sample and using the information to detect the mutation. Specifically, the method according to an embodiment compares the corrected background error distribution with the mutation probability distribution obtained from nucleic acid sequence analysis data in the sample to be analyzed, so that true positive mutations can be more accurately determined.

도 1은 일 양태로서의 진양성 변이를 판별하는 방법에서, 각 단계를 시계열적 순서에 따라 나타낸 도이다.
도 2는 진양성 변이를 판별하는 방법을 모식화한 것으로서, 도 2의 A는 종래의 진양성 변이를 판별하는 방법을 나타낸 것이며, 도 2의 B는 일 양태로서의 진양성 변이를 판별하는 방법을 나타낸 것이다.
도 3은 12가지 뉴클레오티드 치환 변이별 오류의 분포를 나타낸 것이다.
도 4는 인간 참조 유전체에 정렬된 분석 대상 시료의 gDNA 염기서열 데이터상 특정 위치(Bin 1, Bin 2, Bin 3)에서, 12가지 뉴클레오티드 치환 변이 별 오류의 분포의 일례를 나타낸 것이다.
도 5는 일 양태로서의 진양성 변이를 판별하는 방법에서, 특정 위치에서의 진양성 변이를 판별하는 과정을 모식화한 것으로서, 구체적으로, 수정된 서열분석 오류분포 데이터와 인간 참조 유전체 서열에 정렬된 분석 대상 시료 내 핵산서열 분석 데이터로부터 획득한 변이 확률 분포를 비교하는 과정을 나타낸 것이다. 여기서, ① 및 ②에 대응하는 리드는 ω값을 산출하는 과정을 포함하는 수정된 서열분석 오류분포 데이터를 획득하는데 사용되었고, ③에 대응하는 리드는 분석 대상 시료 내 핵산 서열 분석 데이터를 획득하는데 사용되었다.
도 6은 일 양태로서의 진양성 변이를 판별하는 장치의 구성들을 도시한 것이다.1 is a diagram showing each step in a time-series order in a method for discriminating a true-positive mutation as an embodiment.
2 is a schematic diagram of a method for discriminating a true-positive mutation, and FIG. 2A shows a conventional method for discriminating a true-positive mutation, and FIG. 2B shows a method for discriminating a true-positive mutation as an embodiment. it has been shown
3 shows the distribution of errors for each of the 12 nucleotide substitution mutations.
4 shows an example of the distribution of errors for each of 12 nucleotide substitution mutations at specific positions (Bin 1, Bin 2, Bin 3) on the gDNA nucleotide sequence data of the sample to be analyzed aligned with the human reference genome.
5 is a schematic diagram illustrating the process of discriminating a true positive mutation at a specific position in a method for discriminating a true positive mutation as an aspect, specifically, aligned with corrected sequencing error distribution data and human reference genome sequence This shows the process of comparing the mutation probability distribution obtained from the nucleic acid sequence analysis data in the sample to be analyzed. Here, the reads corresponding to ① and ② were used to obtain corrected sequencing error distribution data including the process of calculating the ω value, and the read corresponding to ③ is used to acquire nucleic acid sequencing data in the sample to be analyzed became
6 shows the configurations of an apparatus for discriminating true-positive mutations as an embodiment.

본 명세서에서 사용되는 용어는 각 기능을 고려하면서 가능한 현재 널리 사용되는 일반적인 용어들을 선택하였으나, 이는 당 기술분야에 종사하는 기술자의 의도 또는 판례, 새로운 기술의 출현 등에 따라 달라질 수 있다. 또한, 특정한 경우는 임의로 선정된 용어도 있으며, 이 경우 해당 설명 부분에서 상세히 그 의미를 기재할 것이다. 따라서, 본 명세서에서 사용되는 용어는 단순한 용어의 명칭이 아닌, 그 용어가 가지는 의미와 본 명세서 전반에 걸친 내용을 토대로 정의되어야 한다.The terms used in the present specification have been selected as widely used general terms as possible while considering each function, but may vary depending on the intention or precedent of a person skilled in the art, the emergence of new technology, and the like. In addition, there are also arbitrarily selected terms in specific cases, and in this case, the meaning will be described in detail in the corresponding description. Therefore, the terms used in this specification should be defined based on the meaning of the term and the content throughout this specification, rather than the simple name of the term.

각 설명들에서, 어떤 부분이 다른 부분과 연결되어 있다고 할 때, 이는 직접적으로 연결되어 있는 경우뿐만 아니라, 그 중간에 다른 구성요소를 사이에 두고 유기적으로 연결되어 있는 경우도 포함한다. 또한 어떤 부분이 어떤 구성요소를 포함한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것 이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미한다. 또한, 명세서에 기재된 "...부", "...모듈"의 용어는 적어도 하나의 기능이나 동작을 처리하는 단위를 의미하며, 이는 하드웨어 또는 소프트웨어로 구현되거나 하드웨어와 소프트웨어의 결합으로 구현될 수 있다.In each of the descriptions, when it is said that a certain part is connected with another part, this includes not only a case in which it is directly connected, but also a case in which it is organically connected with another component interposed therebetween. Also, when it is said that a part includes a certain component, it means that other components may be further included, rather than excluding other components, unless otherwise stated. In addition, the terms "...unit" and "...module" described in the specification mean a unit that processes at least one function or operation, which may be implemented as hardware or software or a combination of hardware and software. can

본 명세서에서 사용되는 "구성된다" 또는 "포함한다" 등의 용어는 명세서 상에 기재된 여러 구성 요소들, 도는 여러 단계들을 반드시 모두 포함하는 것으로 해석되지 않아야 하며, 그 중 일부 구성 요소들 또는 일부 단계들은 포함되지 않을 수도 있고, 또는 추가적인 구성 요소 또는 단계들을 더 포함할 수 있는 것으로 해석되어야 한다.As used herein, terms such as “consisting of” or “comprising” should not be construed as necessarily including all of the various components or various steps described in the specification, and some components or some steps thereof. It should be construed that they may not be included, or may further include additional components or steps.

각 설명은 권리범위를 제한하는 것으로 해석되지 말아야 하며, 해당 기술분야의 통상의 기술자가 용이하게 유추할 수 있는 것은 권리범위에 속하는 것으로 해석되어야 할 것이다. Each description should not be construed as limiting the scope of rights, and what can be easily inferred by those skilled in the art should be construed as belonging to the scope of rights.

일 양태로서, 컴퓨터를 이용한 시스템에서, In one aspect, in a system using a computer,

(a) 정상세포 또는 변이세포 각각으로부터 유래된, 체세포 변이와 생식세포 변이가 포함되지 않은 핵산에 대한 정상세포의 서열분석 오류분포 데이터 및 변이세포의 서열분석 오류분포 데이터를 독립적으로 획득하는 단계;(a) independently obtaining sequencing error distribution data of normal cells and sequencing error distribution data of mutant cells for nucleic acids derived from normal cells or mutant cells that do not contain somatic and germline mutations;

(b) 상기 데이터로부터 정상세포의 서열분석 오류분포 평균값 대비 변이세포의 서열분석 오류분포 평균값의 비율인 ω값을 산출하는 단계; (b) calculating an ω value, which is a ratio of the average sequencing error distribution of mutant cells to the average sequencing error distribution of normal cells, from the data;

(c) 상기 산출된 ω값을 정상세포의 서열분석 오류분포 데이터에 가중치로 부여하여, 수정된 서열분석 오류분포 데이터를 획득하는 단계; 및(c) assigning the calculated ω value to the sequencing error distribution data of normal cells as a weight to obtain corrected sequencing error distribution data; and

(d) 상기 수정된 서열분석 오류분포 데이터와, 인간 참조 유전체 서열에 정렬된 분석 대상 시료 내 핵산 서열 분석 데이터로부터 획득한 변이 확률 분포 값간 통계적 유의성을 평가하는 단계를 포함하는, 진양성 변이를 판별하는 방법을 제공한다.(d) determining the true positive variation, comprising the step of evaluating the statistical significance between the corrected sequencing error distribution data and the variation probability distribution value obtained from the nucleic acid sequence analysis data in the analysis target sample aligned with the human reference genome sequence provides a way to

본 명세서에서 사용된 용어, "핵산 서열 분석(nucleic acid sequencing analysis)"은 차세대 핵산 서열분석(next generation sequencing: NGS)인 것일 수 있다. 핵산 서열분석은 염기 서열분석, 서열분석 또는 시퀀싱 (sequencing)과 상호 교환적으로 사용되는 것일 수 있다. 상기 NGS는 대규모 병렬 서열분석(massive parallel sequencing) 또는 2세대 서열분석(second-generation sequencing)과 상호 교환적으로 사용되는 것일 수 있다. 상기 NGS는 대량의 단편의 핵산을 동시다발적으로 서열분석하는 기법으로서, 칩(chip) 기반 그리고 중합효소 연쇄 반응 (polymerase chain reaction: PCR) 기반 쌍 말단(paired end) 형식으로 전장 유전체를 조각내고, 상기 조각을 혼성화 반응(hybridization)에 기초하여 초고속으로 서열 분석을 수행하는 것일 수 있다. 상기 NGS는 예를 들면, 454 플랫폼(Roche), GS FLX 티타늄, Illumina MiSeq, Illumina HiSeq, Illumina HiSeq 2500, Illumina Genome Analyzer, Solexa platform, SOLiD System(Applied Biosystems), Ion Proton(Life Technologies), Complete Genomics, Helicos Biosciences Heliscope, Pacific Biosciences의 단일 분자 실시간(SMRT™) 기술, 또는 이들의 조합에 의해 수행되는 것일 수 있다. 상기 핵산 서열분석은 관심 영역만을 분석하기 위한 핵산 서열분석법인 것일 수 있다. 상기 핵산 서열분석은, 예를 들면, NGS 기반의 표적 서열분석(targeted sequencing), 표적 딥 서열분석(targeted deep sequencing), 또는 패널 서열분석(panel sequencing)을 포함하는 것일 수 있다.As used herein, the term “nucleic acid sequencing analysis” may be next generation sequencing (NGS). Nucleic acid sequencing may be used interchangeably with base sequencing, sequencing, or sequencing. The NGS may be used interchangeably with massive parallel sequencing or second-generation sequencing. The NGS is a technique for simultaneous sequencing of nucleic acids of a large number of fragments, and fragments the entire genome in a chip-based and polymerase chain reaction (PCR)-based paired end format. , it may be to perform sequencing at high speed based on hybridization of the fragment. The NGS is, for example, 454 platform (Roche), GS FLX titanium, Illumina MiSeq, Illumina HiSeq, Illumina HiSeq 2500, Illumina Genome Analyzer, Solexa platform, SOLiD System (Applied Biosystems), Ion Proton (Life Technologies), Complete Genomics , Helicos Biosciences Heliscope, Pacific Biosciences single molecule real-time (SMRT™) technology, or a combination thereof. The nucleic acid sequencing may be a nucleic acid sequencing method for analyzing only a region of interest. The nucleic acid sequencing may include, for example, NGS-based targeted sequencing, targeted deep sequencing, or panel sequencing.

도 1은 상기 진양성 변이를 판별하는 방법의 전반적인 흐름을 나타내는 도면이다. 도 1을 참고하면, 상기 진양성 변이를 판별하는 방법은 서열분석 오류분포 데이터를 획득하는 단계(110), 정상세포 대비 변이세포에 대한 서열분석 오류분포 평균값의 비율을 산출하는 단계(120), 수정된 서열분석 오류분포 데이터를 획득하는 단계(130), 및 수정된 서열분석 오류분포 데이터와 분석 대상 시료 내 핵산 서열 분석 데이터간 통계적 유의성을 평가하는 단계(140)를 포함할 수 있다.1 is a diagram showing the overall flow of the method for determining the true-positive mutation. Referring to FIG. 1, the method for determining the true-positive mutation includes the steps of obtaining sequencing error distribution data 110, calculating the ratio of the average value of the sequencing error distribution for mutant cells compared to normal cells 120, The method may include acquiring corrected sequencing error distribution data (130), and evaluating statistical significance between the corrected sequencing error distribution data and nucleic acid sequencing data in a sample to be analyzed (140).

상기 서열분석 오류분포 데이터를 획득하는 단계(110)에서는 정상세포 또는 변이세포 각각으로부터 유래된, 체세포 변이와 생식세포 변이가 포함되지 않은 핵산에 대한 정상세포의 서열분석 오류분포 데이터 및 변이세포의 서열분석 오류분포 데이터를 독립적으로 획득한다. In the step 110 of obtaining the sequencing error distribution data, the sequencing error distribution data of normal cells and the sequence of mutant cells for nucleic acids derived from normal cells or mutant cells, each of which does not contain somatic and germline mutations. Acquire the analysis error distribution data independently.

상기 단계에서, 핵산은 유전체 또는 그의 절편일 수 있다. 본 명세서에서 사용된 용어, "유전체(genome)"는 염색체, 염색질, 또는 유전자의 전체를 총칭하는 용어이다. 상기 유전체 또는 그의 절편은 분리된 DNA, 예를 들어, 세포를 포함하지 않는 핵산 (cell-free DNA: cf DNA)일 수 있다. 상기 세포로부터 핵산을 추출 또는 분리하는 방법은 통상의 기술자에게 공지된 방법으로 수행될 수 있다. 여기서, 절편은 유전체를 물리적, 화학적, 또는 효소적으로 절단하는 것을 의미하며, 상기 과정을 통해 다양한 길이(length)를 갖는 리드를 생성하는 것일 수 있다. 본 명세서에서 사용된 용어, "리드(read)"는 핵산 서열 분석에서 생성된 하나 이상의 핵산 절편의 서열 정보를 의미하며, 상기 리드는 약 10bp 내지 약 2000bp, 예를 들어, 약 15bp 내지 약 1500bp, 약 20bp 내지 약 1000bp, 약 20bp 내지 약 500bp, 약 20bp 내지 약 200bp, 약 20bp 내지 약 100bp일 수 있으나, 이에 제한되는 것은 아니다. In the above step, the nucleic acid may be a genome or a fragment thereof. As used herein, the term “genome” is a generic term for the entirety of chromosomes, chromatin, or genes. The genome or fragment thereof may be isolated DNA, for example, cell-free nucleic acid (cf DNA). The method for extracting or isolating the nucleic acid from the cell may be performed by a method known to those skilled in the art. Here, fragment refers to physically, chemically, or enzymatically cleaving a genome, and may be to generate reads having various lengths through the above process. As used herein, the term "read" refers to sequence information of one or more nucleic acid fragments generated in nucleic acid sequencing, wherein the read is from about 10 bp to about 2000 bp, for example, from about 15 bp to about 1500 bp, It may be about 20 bp to about 1000 bp, about 20 bp to about 500 bp, about 20 bp to about 200 bp, about 20 bp to about 100 bp, but is not limited thereto.

상기 서열분석 오류분포 데이터를 획득하는 단계는 통상의 기술자에게 공지된 방법으로 수행될 수 있으며, 예를 들어, Nucleic acids research　41.7 (2013): e89-e89에 기술된 방법에 따라 수행될 수 있으나, 이에 제한되는 것은 아니다. The step of acquiring the sequencing error distribution data may be performed by a method known to those skilled in the art, for example, it may be performed according to the method described in Nucleic acids research 41.7 (2013): e89-e89, However, the present invention is not limited thereto.

상기 단계에서, 상기 정상세포는 체세포 유전자 변이가 존재하지 않는 핵산을 포함하는 세포일 수 있으며, 상기 세포는 핵산서열 분석에서 배경오류 데이터를 확보하기 위한 것으로, 세포 내 핵산에 체세포 변이와 생식세포 유전자 변이가 없는 세포라면 특별히 상기 종류에 제한되지는 않는다. 구체적으로, 상기 정상세포는 생물학적 연구에서 일반적으로 정상 짝(matched normal)으로 구분되는 종류를 모두 포함한다. 예를 들면, 병변이 존재하는 조직 주변에 위치한, 병변이 보이지 않는 암이 아닌 정상세포를 사용 가능하며, 암의 정상조직을 대체하여, 일반적으로 정상 짝으로 사용되는 혈액의 백혈구계 세포를 사용할 수 있다.In the above step, the normal cell may be a cell containing a nucleic acid in which somatic genetic mutation does not exist, and the cell is for securing background error data in nucleic acid sequencing analysis. As long as there is no mutation, the cell type is not particularly limited. Specifically, the normal cells include all types generally classified as matched normals in biological studies. For example, it is possible to use normal cells that are not cancerous with no lesions located around the tissue in which the lesion exists, and it is possible to use leukocyte cells from blood, which are normally used as normal partners, by replacing the normal tissue of cancer. have.

상기 단계에서, 상기 변이세포는 체세포 유전자 변이가 존재하는 핵산을 포함하는 세포일 수 있고, 이는 정상세포, 각 변이세포 별로 핵산서열 분석 시 그 오류의 분포가 모두 상이하게 나타남을 반영하기 위한 것일 수 있다. 구체적으로, 질환에 영향을 미치는 변이를 포함하는 핵산을 포함하는 세포를 모두 포함하는 것일 수 있다. 이는, 질환이 있는 핵산의 변이를 평가할 때 해당 변이의 통계적인 평가를 하기 위한 기술을 구현하기 위한 것일 수 있고, 예를 들면, 변이세포의 핵산을 포함하고 있는 세포 유리 DNA(cfDNA)를 포함하는 세포일 수 있으나, 체세포 변이를 포함하는 세포라면 특별히 제한되지 않는다.In the above step, the mutant cell may be a cell containing a nucleic acid in which a somatic genetic mutation exists, and this is to reflect that the distribution of errors is all different when analyzing the nucleic acid sequence for each normal cell and each mutant cell. have. Specifically, it may include all cells containing a nucleic acid containing a mutation affecting a disease. This may be to implement a technique for statistical evaluation of the mutation when evaluating the mutation of a nucleic acid having a disease, for example, including cell-free DNA (cfDNA) containing the nucleic acid of the mutant cell. It may be a cell, but it is not particularly limited as long as it is a cell containing somatic mutation.

상기 단계는 체세포 변이와 생식세포 변이가 포함되지 않은 핵산서열에 대한 오류분포 데이터를 획득하여 핵산서열 분석에 따르는 배경오류 분포를 각 세포별로 획득하기 위한 단계로서, 체세포 변이와 생식세포 변이를 제거된 세포에 대하여 핵산에 대한 서열분석 오류분포 데이터를 획득하는 단계를 포함할 수 있다. 상기 '제거'는 인간 참조 유전체 서열에 정렬된 변이세포의 핵산서열 분석 데이터 상에서 일치하지 않는 서열 데이터를 인간 참조 유전체 서열상의 데이터로 변경하는 것을 의미를 포함하는 것일 수 있다.The above step is a step for acquiring error distribution data for nucleic acid sequences that do not include somatic cell mutations and germ cell mutations to obtain a background error distribution according to nucleic acid sequence analysis for each cell. It may include obtaining sequencing error distribution data for the nucleic acid with respect to the cell. The 'removal' may include changing sequence data that does not match on nucleic acid sequencing data of mutant cells aligned with a human reference genome sequence to data on a human reference genome sequence.

상기 정상세포 대비 변이세포에 대한 서열분석 오류분포 평균값의 비율을 산출하는 단계(120)에서는 상기 (a) 단계의 데이터로부터 정상세포의 서열분석 오류분포 평균값 대비 변이세포의 서열분석 오류분포 평균값의 비율인 ω값을 산출한다. In the step 120 of calculating the ratio of the average sequencing error distribution of the normal cells to the mutant cells, the ratio of the average sequencing error distribution of the mutant cells to the average sequencing error distribution of normal cells from the data of step (a) The value of ω is calculated.

상기 단계는 상기 (a) 단계에서 획득한 각 세포 내 핵산서열에 대한 서열분석 오류분포 데이터로부터 정상세포의 오류분포 평균값 대비 변이세포의 오류분포 평균값의 비율인 ω값을 산출하는 단계로서, 상기 오류분포의 평균값은 오류가 발생할 확률의 평균값을 의미할 수 있고, 보다 구체적으로는, 오류가 발생할 확률의 분포도 상에서 획득한 오류 발생 확률의 평균값을 의미하는 것일 수 있다. The step is a step of calculating the ω value, which is the ratio of the mean error distribution of mutant cells to the mean error distribution of normal cells, from the sequencing error distribution data for the nucleic acid sequences in each cell obtained in step (a). The average value of the distribution may mean an average value of the probability that an error will occur, and more specifically, may mean an average value of the probability of occurrence of an error obtained on a distribution diagram of the probability of an error.

상기 단계는 12개의 뉴클레오티드 치환 유형 각각에 대한 ω값을 산출하는 단계를 포함할 수 있으며, 상기 12개의 뉴클레오티드 치환 유형이란 A>G, A>T, A>C, G>A, G>T, G>C, T>A, T>G, T>C, C>A, C>G 및 C>T의 12가지 유형을 의미하는 것일 수 있다. 일 실시예에 따르면, 도 3에 나타낸 바와 같이, 12개의 치환 유형 각각에 대하여 오류가 발생할 확률의 분포도 상에서 획득한 오류 발생 확률의 평균값의 비율, 즉, ω값을 산출할 수 있다. The step may include calculating the ω value for each of the 12 nucleotide substitution types, wherein the 12 nucleotide substitution types are A>G, A>T, A>C, G>A, G>T, It may mean 12 types of G>C, T>A, T>G, T>C, C>A, C>G and C>T. According to an embodiment, as shown in FIG. 3 , a ratio of the average values of the error occurrence probabilities obtained on the distribution diagram of the error occurrence probability for each of the 12 substitution types, that is, the ω value may be calculated.

상기 단계는 상기 오류분포 데이터 내 특정 위치에서의 12개의 뉴클레오티드 치환 유형 각각에 대한 ω값을 산출하는 단계를 포함할 수 있다. 일 실시예에 따르면, 도 4에 나타낸 바와 같이, 해당 뉴클레오티드 치환변이가 참조 유전체에 의해 정렬된 서열분석 데이터상에서, 특정 위치, 구체적으로는, 참조 유전체를 기준으로 표현된 임의의 위치에서, 12개의 치환 유형 각각에 대하여 오류가 발생할 확률의 분포도 상에서 획득한 오류 발생 확률의 평균값의 비율, 즉, ω값을 산출할 수 있다. The step may include calculating an ω value for each of the 12 nucleotide substitution types at a specific position in the error distribution data. According to one embodiment, as shown in FIG. 4, on the sequencing data in which the nucleotide substitution mutation is aligned by the reference genome, 12 It is possible to calculate the ratio of the average value of the error occurrence probability obtained on the distribution diagram of the error occurrence probability for each substitution type, that is, the ω value.

한편, 상기 참조 유전체는 NCBI(National Center for BiotechnologyInformation), GEO (Gene　Expression Omnibus), FDA(Food and Drug Administration), My Cancer Genome, 또는 KFDA(식품의약품안전처) 등과 같은 당해 기술분야에서 이미 공지된 데이터베이스(DB)로부터 획득된 것일 수 있다. 즉, 참조 유전체는 공개 게놈 데이터 또는 공개 합맵(HapMap) 데이터로부터 획득된 것일 수 있다. On the other hand, the reference genome is already known in the art, such as NCBI (National Center for Biotechnology Information), GEO (Gene = Expression Omnibus), FDA (Food and Drug Administration), My Cancer Genome, or KFDA (Ministry of Food and Drug Safety), etc. It may be obtained from the database DB. That is, the reference genome may be obtained from public genomic data or public HapMap data.

상기 단계는 상기 (a) 단계의 서열분석 오류분포 데이터를 획득하기 위해 사용된 정상세포 또는 변이세포로부터 유래된 리드들을 위치(position) 정보에 따라 그룹화하는 단계; 및 동일한 위치 정보를 포함하는 정상세포 또는 변이세포로부터 유래된 리드들에 대한 서열분석 오류분포 데이터에서, 뉴클레오티드 치환 유형 각각에 대한 ω값을 산출하는 단계를 포함할 수 있으며, 이러한 경우, 12개의 뉴클레오티드 치환 유형 각각이 리드 내 존재하는 위치정보를 반영하여, 보다 정확한 ω값을 산출할 수 있다.The step may include: grouping the reads derived from normal cells or mutant cells used to obtain the sequencing error distribution data of step (a) according to position information; and calculating an ω value for each type of nucleotide substitution from sequencing error distribution data for reads derived from normal cells or mutant cells containing the same position information, in which case, 12 nucleotides A more accurate ω value can be calculated by reflecting the positional information present in each of the substitution types in the read.

상기 그룹화하는 단계에서, 리드 그룹의 염기쌍 길이 단위는 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 또는 16bp일 수 있으나, 반드시 이에 제한되지 아니하고, 상기 방법을 실시하는 통상의 기술자가 리드를 복수의 그룹으로 설정하는 과정에서 적절히 선택할 수 있다.In the grouping step, the base pair length unit of the lead group may be 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 or 16 bp, but is not necessarily limited thereto. Instead, a person skilled in the art performing the above method may appropriately select the leads in the process of setting the leads into a plurality of groups.

상기 ω값은 (a) 단계에서 획득한 데이터로부터 정상세포의 오류분포 평균값 대비 변이세포의 오류분포 평균값의 비율을 의미하는 것으로서, 수학식으로 나타내면 하기와 같다:The ω value means the ratio of the average error distribution of mutant cells to the average error distribution of normal cells from the data obtained in step (a), and is expressed by the following equation:

[수학식 1][Equation 1]

ω = 변이세포의 오류분포 평균값/정상세포의 오류분포 평균값.ω = mean value of error distribution of mutant cells/mean value of error distribution of normal cells.

상기 수정된 서열분석 오류분포 데이터를 획득하는 단계(130)는 상기 산출된 ω값을 정상세포의 서열분석 오류분포 데이터에 가중치로 부여하여, 수정된 서열분석 오류분포 데이터를 획득한다. In the step of obtaining the corrected sequencing error distribution data 130, the calculated ω value is assigned to the sequencing error distribution data of normal cells as a weight to obtain corrected sequencing error distribution data.

상기 단계는 상기 산출된 ω값을 정상세포의 서열분석 오류분포 데이터에 가중치로 부여하여, 수정된 서열분석 오류분포 데이터를 획득하는 단계로서, 정상세포의 서열분석 오류분포 데이터에 상기 (b) 단계에서 산출된 ω값을 곱하여 수정된 서열분석 오류분포 데이터를 획득하는 단계를 포함할 수 있고, 리드들을 위치정보에 따라 그룹화하여 특정 위치유형 각각에 대한 ω값을 산출한 경우, 상기 그룹화된 정상세포의 서열분석 오류분포 데이터에 ω값을 곱하여 수정된 서열분석 오류분포 데이터를 획득하는 단계를 포함할 수 있다.The step is a step of obtaining corrected sequencing error distribution data by assigning the calculated ω value to the sequencing error distribution data of normal cells as a weight. may include obtaining corrected sequencing error distribution data by multiplying the ω value calculated in and obtaining corrected sequencing error distribution data by multiplying the sequencing error distribution data of the ω value.

상기 단계에서, 상기 ω값을 가중치로 부여하는 것은 분석 대상 시료 내 정상세포의 핵산과 변이세포의 핵산이 혼입되어 있음을 반영하고자 함일 수 있다. 구체적으로, 인간 참조 유전체 서열과 일치하지 않으나 그 변이 발생 확률(Variant Allele Frequency; VAF)이 유의미하지 않아 양으로 판별되지 않는 것에 대해서도, 상기 혼입비율을 반영한 ω값을 가중치로 부여하여 통계적으로 유의미한 경우 진양성 변이로 판별할 수 있다. 상기 ω값을 가중치로 부여함에 따라 분석 대상 시료 내 핵산서열 분석 데이터로부터 획득한 변이 확률 분포도는 쉬프트(shift)될 수 있고, 이를 이용하여 당업계에 널리 알려진 통계적 유의성 평가방법으로 통계적 유의성을 평가하여 진양성 변이 여부를 판별할 수 있는 것이다.In the above step, assigning the ω value as a weight may be to reflect that the nucleic acid of the normal cell and the nucleic acid of the mutant cell are mixed in the sample to be analyzed. Specifically, even if it does not match the human reference genome sequence, but the mutation probability (Variant Allele Frequency; VAF) is insignificant and is not determined as positive, the ω value reflecting the mixing ratio is given as a weight to be statistically significant It can be identified as a true-positive mutation. As the ω value is assigned as a weight, the variation probability distribution obtained from the nucleic acid sequence analysis data in the sample to be analyzed may be shifted, and using this, statistical significance is evaluated by a statistical significance evaluation method well known in the art. It is possible to determine whether or not a true-positive mutation is present.

상기 수정된 서열분석 오류분포 데이터와 분석 대상 시료 내 핵산 서열 분석 데이터간 통계적 유의성을 평가하는 단계(140)는 상기 수정된 서열분석 오류분포 데이터와, 인간 참조 유전체 서열에 정렬된 분석 대상 시료 내 핵산서열 분석 데이터로부터 획득한 변이 확률 분포 값간 통계적 유의성을 평가한다. In the step 140 of evaluating the statistical significance between the corrected sequencing error distribution data and the nucleic acid sequence analysis data in the analysis target sample, the corrected sequencing error distribution data and the nucleic acid in the analysis target sample aligned with the human reference genome sequence Evaluate the statistical significance between the variance probability distribution values obtained from the sequencing data.

상기 단계에서, 상기 분석 대상 시료 내 핵산서열 분석 데이터로부터 획득한 변이는 뉴클레오티드 치환 변이일 수 있다.In the above step, the mutation obtained from the nucleic acid sequence analysis data in the analysis target sample may be a nucleotide substitution mutation.

상기 단계는 분석 대상 시료 내 핵산 서열 분석 데이터를 위치 정보에 따라 분류하는 하는 단계; 및 1) 상기 분류된 핵산 서열 분석 데이터로부터 획득한 변이 확률 분포 값과, 상기 리드들을 위치정보를 반영하여 2) 수정된 서열분석 오류분포 데이터 간 통계적 유의성을 평가하는 단계를 포함할 수 있으며, 여기서, 상기 분류된 핵산 서열 분석 데이터와 수정된 서열분석 오류분포 데이터는 동일한 위치 정보를 포함하는 리드들로부터 획득한 것일 수 있다.The step is to classify the nucleic acid sequence analysis data in the analysis target sample according to location information; and 1) evaluating the statistical significance between the mutation probability distribution value obtained from the classified nucleic acid sequencing data and 2) the corrected sequencing error distribution data by reflecting the position information of the reads, wherein , The classified nucleic acid sequencing data and the corrected sequencing error distribution data may be obtained from reads including the same positional information.

한편, 도 5는 일 양태로서의 진양성 변이를 판별하는 방법에서, 특정 위치에서의 진양성 변이를 판별하는 과정을 모식화한 것이다. 도 5를 참고하면, ① 및 ②에 대응하는 리드 및 서열분석 오류분포 데이터로부터 ω값을 산출하는 과정을 포함하는 전술한 단계를 수행하여, 수정된 서열분석 오류분포 데이터를 획득하고, ③에 대응하는 리드 및 핵산 서열 분석 데이터로부터 변이 확률 분포를 독립적으로 획득한 뒤, 상기 두 분포들간 통계적 유의성을 평가하는 것일 수 있다. Meanwhile, FIG. 5 schematically illustrates a process for discriminating a true-positive mutation at a specific location in a method for discriminating a true-positive mutation as an embodiment. Referring to FIG. 5 , by performing the above-described steps including the process of calculating the ω value from the read and sequencing error distribution data corresponding to ① and ②, corrected sequencing error distribution data is obtained, and corresponding to ③ It may be to independently obtain a mutation probability distribution from read and nucleic acid sequence analysis data, and then to evaluate the statistical significance between the two distributions.

상기 단계에서, 변이 확률 분포는 유전자의 서열상 변이가 발생할 수 있는 확률을 나타내는 분포도라면 특별히 제한되지 않으나, 바람직하게는 뉴클레오티드 치환 변이 확률 분포일 수 있다.In the above step, the mutation probability distribution is not particularly limited as long as it is a distribution diagram indicating the probability that a mutation may occur in the sequence of a gene, but it may preferably be a nucleotide substitution mutation probability distribution.

상기 방법은 상기 수정된 서열분석 오류분포 데이터와 인간 참조 유전체 서열에 정렬된 분석 대상 시료 내 핵산서열 분석 데이터로부터 획득한 변이 확률 분포 값간 통계적으로 유의한 차이를 나타내는 경우, 진양성 변이로 판별하는 단계를 더 포함할 수 있다(도 2의 B, 진양성 변이 참고). 또한, 수정된 서열분석 오류분포 데이터와 인간 참조 유전체 서열에 정렬된 분석 대상 시료 내 핵산서열 분석 데이터로부터 획득한 변이 확률 분포 값간 통계적으로 유의한 차이를 나타내지 않는 경우 위양성 변이로 판별하는 단계를 더 포함할 수 있다(도 2의 B, 위양성 변이 참고). The method includes the steps of determining a true positive mutation when it shows a statistically significant difference between the corrected sequencing error distribution data and the mutation probability distribution value obtained from the nucleic acid sequencing data in the analysis target sample aligned with the human reference genome sequence may further include (see FIG. 2B, true positive mutation). In addition, if there is no statistically significant difference between the corrected sequencing error distribution data and the mutation probability distribution value obtained from the nucleic acid sequencing data in the sample to be analyzed aligned with the human reference genome sequence, the step of determining as a false-positive mutation is further included It can be done (see Fig. 2B, false-positive mutation).

일례로서, 상기의 통계적 유의성 판단은 하기 수학식 2에 따라 수행될 수 있다: As an example, the statistical significance determination may be performed according to the following Equation 2:

[수학식 2][Equation 2]

상기 수학식 2에서, In Equation 2 above,

P(X≥1/ω*VAF_variation)는 전형적인 정상 샘플과 변이 샘플간 오류 분포를 비교하여, 12개의 뉴클레오티드 치환 유형별 발생하는 오류로부터 산출한 오류분포의 평균값의 비율을 반영한 확률 모델로서, VAF_variation의 진양성 또는 위양성 여부를 판별하기 위한 모델을 의미한다. _{P (X≥1 / ω * VAF variation} ) is a probability model that reflects the ratio of the average value of the error distribution calculated by the error, the occurrence of 12 nucleotide substitutions compared to the typical type of normal samples and variation between samples error distribution, VAF _variation It means a model for determining whether the test is true positive or false positive.

Z는 정상 샘플의 오류분포 X와 VAF를 비교하기 위한 표준화 분포(Z distribution)을 의미한다.Z means a standardized distribution (Z distribution) for comparing the error distribution X and VAF of a normal sample.

Error(Variation)_jk는 변이 샘플의 오류분포 평균값으로, j는 참조 뉴클레오티드 정보, k는 샘플의 뉴클레오타이드 정보를 의미하며, 상기 j 및 k는 A, T, C, 및 G 중 하나이며, 이들은 각각 상이하다. Error (Variation) _jk is the average error distribution of the mutant sample, j is the reference nucleotide information, k is the nucleotide information of the sample, wherein j and k are one of A, T, C, and G, which are different from each other do.

Error(normal)_jk는 정상 샘플의 오류분포 평균값으로, j는 참조 뉴클레오타이드 정보, k는 샘플의 뉴클레오타이드 정보를 의미하며, 상기 j 및 k는 A, T, C, 및 G 중 하나이며, 이들은 각각 상이하다. Error (normal) _jk is an average error distribution of a normal sample, j is reference nucleotide information, k is nucleotide information of a sample, wherein j and k are one of A, T, C, and G, which are different from each other do.

VAF_variation또는

는정상 샘플의 오류분포 상에 변이 샘플의 분포를 반영한 것을 의미한다. VAF _variation or

IsIt means that the distribution of mutant samples is reflected on the error distribution of normal samples.

일 실시예에 따르면, 상기 방법은 변이 샘플의 특성, 구체적으로, 변이의 위치 및 치환 유형에 따라 다르게 나타나는 오류 분포를 파악하고 해당 정보를 변이 판별에 적용한 것으로서, 핵산서열 분석 데이터에 의해 검출된 변이, 구체적으로, 통계학적으로 오차가 발생할 수 있는 영역의 변이에 대해서도, 진양성 변이와 위양성 변이를 보다 정확하게 판별할 수 있다.According to an embodiment, the method identifies the error distribution that appears differently depending on the characteristics of the mutation sample, specifically, the position and substitution type of the mutation, and applies the information to the mutation determination, and the mutation detected by the nucleic acid sequencing data , more precisely, true-positive mutations and false-positive mutations can be more accurately discriminated even for mutations in regions where statistical errors can occur.

상기 방법에서, (a) 단계의 핵산에 대한 서열분석 오류분포 데이터 또는 (d) 단계의 분석 대상 시료 내 핵산서열 분석 데이터는 차세대 염기서열 분석(next generation sequencing), 표적 염기서열 분석(targeted sequencing), 표적 딥 염기서열 분석(targeted deep sequencing) 또는 패널 염기서열 분석(panel sequenceing)에 의한 데이터일 수 있고, 보다 구체적으로는 차세대 염기서열 분석에 의한 데이터일 수 있다.In the above method, the sequencing error distribution data for the nucleic acid in step (a) or the nucleic acid sequence analysis data in the sample to be analyzed in step (d) is next generation sequencing, target sequencing. , data by targeted deep sequencing or panel sequencing, and more specifically, data by next-generation sequencing.

일 양태로서, 상기 진양성 변이를 판별하는 방법을 수행하기 위한 컴퓨터 프로그램이 수록된 컴퓨터에서 읽을 수 있는 기록매체를 제공한다.In one aspect, there is provided a computer-readable recording medium in which a computer program for performing the method for determining the true-positive mutation is recorded.

상기 방법은 다양한 컴퓨터 수단을 통하여 판독 가능한 소프트웨어 형태로 구현되어 컴퓨터로 읽을 수 있는 기록매체에 기록될 수 있다. 여기서, 기록매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 기록매체에 기록되는 프로그램 명령은 상기에 따른 방법을 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 해당 분야의 통상의 기술자에게 공지되어 사용 가능한 것일 수도 있다.The method may be implemented in the form of software readable through various computer means and recorded in a computer readable recording medium. Here, the recording medium may include a program command, a data file, a data structure, etc. alone or in combination. The program instructions recorded on the recording medium may be specially designed and configured for the above method, or may be known and available to those skilled in the art of computer software.

예컨대 기록매체는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(Magnetic Media), CDROM(Compact Disk Read Only Memory), DVD(Digital Video Disk)와 같은 광 기록 매체(Optical Media), 플롭티컬 디스크(Floptical Disk)와 같은 자기-광 매체(Magneto-Optical Media), 및 롬(ROM), 램(RAM, Random Access Memory), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치를 포함한다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함할 수 있다. 이러한 하드웨어 장치는 상기에 따른 방법의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.For example, the recording medium includes a magnetic medium such as a hard disk, a floppy disk, and a magnetic tape, an optical recording medium such as a compact disk read only memory (CDROM), a digital video disk (DVD), a floppy disk ( Magneto-Optical Media, such as a Floptical Disk, and hardware devices specially configured to store and execute program instructions such as ROM, Random Access Memory (RAM), Flash memory, and the like. Examples of program instructions may include high-level language codes that can be executed by a computer using an interpreter or the like as well as machine language codes such as those generated by a compiler. Such hardware devices may be configured to act as one or more software modules to perform the operations of the methods according to the above, and vice versa.

비록 본 명세서와 도면에서는 예시적인 장치 구성을 기술하고 있지만, 본 명세서에서 설명하는 기능적인 동작과 주제의 구현물들은 다른 유형의 디지털 전자 회로로 구현되거나, 본 명세서에서 개시하는 구조 및 그 구조적인 등가물들을 포함하는 컴퓨터 소프트웨어, 펌웨어 혹은 하드웨어로 구현되거나, 이들 중 하나 이상의 결합으로 구현 가능하다. 본 명세서에서 설명하는 주제의 구현물들은 하나 이상의 컴퓨터 프로그램 제품, 다시 말해 상기 방법에 따른 장치의 동작을 제어하기 위하여 혹은 이것에 의한 실행을 위하여 유형의 프로그램 저장매체 상에 인코딩된 컴퓨터 프로그램 명령에 관한 하나 이상의 모듈로서 구현될 수 있다. 컴퓨터로 읽을 수 있는 매체는 기계로 판독 가능한 저장 장치, 기계로 판독 가능한 저장 기판, 메모리 장치, 기계로 판독 가능한 전파형 신호에 영향을 미치는 물질의 조성물 혹은 이들 중 하나 이상의 조합일 수 있다.Although this specification and drawings describe exemplary device configurations, implementations of the functional operations and subject matter described herein may be implemented in other types of digital electronic circuits, or may represent structures disclosed herein and structural equivalents thereof. It may be implemented as computer software, firmware, or hardware including, or a combination of one or more of these. Implementations of the subject matter described herein are directed to one or more computer program products, ie computer program instructions encoded on a tangible program storage medium for execution by or for controlling operation of an apparatus according to the method. It can be implemented as the above modules. The computer-readable medium may be a machine-readable storage device, a machine-readable storage substrate, a memory device, a composition of matter that affects a machine-readable radio wave signal, or a combination of one or more thereof.

상기 방법에 따른 장치에 탑재되고 상기 방법을 수행하는 컴퓨터 프로그램(프로그램, 소프트웨어, 소프트웨어 어플리케이션, 스크립트 혹은 코드로도 알려져 있음)은 컴파일 되거나 해석된 언어나 선험적 혹은 절차적 언어를 포함하는 프로그래밍 언어의 어떠한 형태로도 작성될 수 있으며, 독립형 프로그램이나 모듈, 컴포넌트, 서브루틴 혹은 컴퓨터 환경에서 사용하기에 적합한 다른 유닛을 포함하여 어떠한 형태로도 전개될 수 있다. 컴퓨터 프로그램은 파일 시스템의 파일에 반드시 대응하는 것은 아니다. 프로그램은 요청된 프로그램에 제공되는 단일 파일 내에, 혹은 다중의 상호 작용하는 파일(예컨대, 하나 이상의 모듈, 하위 프로그램 혹은 코드의 일부를 저장하는 파일) 내에, 혹은 다른 프로그램이나 데이터를 보유하는 파일의 일부(예컨대, 마크업 언어 문서 내에 저장되는 하나 이상의 스크립트) 내에 저장될 수 있다. 컴퓨터 프로그램은 하나의 사이트에 위치하거나 복수의 사이트에 걸쳐서 분산되어 통신 네트워크에 의해 상호 접속된 다중 컴퓨터나 하나의 컴퓨터 상에서 실행되도록 전개될 수 있다.A computer program (also known as a program, software, software application, script or code) mounted on an apparatus according to the method and performing the method may contain any compiled or interpreted language or any programming language, including a priori or procedural language. It can be written in any form, and can be deployed in any form, including stand-alone programs, modules, components, subroutines, or other units suitable for use in a computer environment. A computer program does not necessarily correspond to a file in a file system. A program may be in a single file provided to the requested program, or in multiple interacting files (eg, files that store one or more modules, subprograms, or portions of code), or portions of files that hold other programs or data. (eg, one or more scripts stored within a markup language document). The computer program may be deployed to be executed on a single computer or multiple computers located at one site or distributed over a plurality of sites and interconnected by a communication network.

일 양태로서, 정상세포 또는 변이세포 각각으로부터 유래된, 체세포 변이와 생식세포 변이가 포함되지 않은 핵산에 대하여, 정상세포의 서열분석 오류분포 데이터 및 변이세포의 서열분석 오류분포 데이터를 독립적으로 획득하는 데이터 수집부(310); 상기 데이터로부터 정상세포의 서열분석 오류분포 평균값 대비 변이세포의 서열분석 오류분포 평균값의 비율인 ω값을 산출하는 데이터 해석부(320); 상기 산출된 ω값을 정상세포의 서열분석 오류분포 데이터에 가중치로 부여하여, 수정된 서열분석 오류분포 데이터를 획득하는 데이터 수정부(330); 및 상기 수정된 서열분석 오류분포 데이터와 인간 참조 유전체 서열에 정렬된 분석 대상 시료 내 핵산서열 분석 데이터로부터 획득한 변이 확률 분포 값간 통계적 유의성을 평가하는 데이터 적용부(340)를 포함하는, 핵산서열 분석에서 진양성변이를 판별하는 장치(300)를 제공한다.In one aspect, with respect to nucleic acids derived from normal cells or mutant cells, each of which does not contain somatic and germline mutations, sequencing error distribution data of normal cells and sequencing error distribution data of mutant cells are independently obtained. data collection unit 310; a data analysis unit 320 for calculating an ω value, which is a ratio of the average sequencing error distribution of mutant cells to the average sequencing error distribution of normal cells, from the data; a data correction unit 330 for obtaining corrected sequencing error distribution data by assigning the calculated ω value to the sequencing error distribution data of normal cells as a weight; and a data application unit 340 for evaluating the statistical significance between the modified sequencing error distribution data and the mutation probability distribution value obtained from the nucleic acid sequencing data in the analysis target sample aligned with the human reference genome sequence, Nucleic acid sequence analysis It provides an apparatus 300 for determining the true positive mutation in the.

도 6은 핵산서열 분석에서 진양성 변이를 판별하는 장치의 구성들을 도시한 도면이다. 상기 장치는 앞서 설명된 진양성 변이를 판별하는 방법을 구현하며, 컴퓨터 판독 매체 또는 이를 포함하는 시스템을 포괄한다. 또한, 도 6에 도시된 구성요소 외에 다른 범용적인 구성요소들이 추가로 포함될 수 있다. 6 is a diagram showing the configurations of an apparatus for determining true positive mutations in nucleic acid sequence analysis. The device implements the method for determining the true-positive mutation described above, and encompasses a computer-readable medium or a system including the same. In addition, other general-purpose components other than the components shown in FIG. 6 may be additionally included.

상기 데이터 수집부는 정상세포 또는 변이세포 각각으로부터 유래된, 체세포 변이와 생식세포 변이가 포함되지 않은 핵산에 대하여, 정상세포의 서열분석 오류분포 데이터 및 변이세포의 서열분석 오류분포 데이터를 독립적으로 획득하는 단계를 수행할 수 있다. 상기 정상세포는 체세포 변이와 생식세포 변이가 존재하지 않는 핵산을 보유하는 것일 수 있고, 변이세포의 경우 체세포 변이와 생식세포 변이가 존재하는 핵산을 보유하는 것일 수 있다. 그러한 경우, 상기 데이터 수집부는 체세포 변이와 생식세포 변이를 제거된 세포로부터 유래된 핵산에 대하여 서열분석 오류분포 데이터를 획득하는 것일 수 있는데, 상기 '제거'는 인간 참조 유전체 서열에 정렬된 변이세포의 핵산서열 분석 데이터 상에서 일치하지 않는 서열 데이터를 인간 참조 유전체 서열상의 데이터로 변경하는 것을 의미를 포함하는 것일 수 있다.The data collection unit independently acquires sequencing error distribution data of normal cells and sequencing error distribution data of mutant cells with respect to nucleic acids derived from normal cells or mutant cells that do not contain somatic and germline mutations. steps can be performed. The normal cells may have nucleic acids in which somatic and germline mutations do not exist, and in the case of mutant cells, they may have nucleic acids in which somatic and germline mutations exist. In such a case, the data collection unit may be to acquire sequencing error distribution data for nucleic acids derived from cells from which somatic and germline mutations have been removed, and the 'removal' refers to mutations of mutant cells aligned with human reference genome sequences. It may include the meaning of changing the sequence data that does not match on the nucleic acid sequencing data to the data on the human reference genome sequence.

상기 데이터 해석부는 상기 데이터 수집부로부터 획득한 데이터로부터 정상세포의 서열분석 오류분포 평균값 대비 변이세포의 서열분석 오류분포 평균값의 비율인 ω값을 산출하는 단계를 수행할 수 있다. 정확한 오류분포의 평균값 비율을 산출한다는 측면에서 바람직하게는 12개의 뉴클레오티드 치환 유형 각각에 대한 ω값을 산출할 수 있고, 보다 정확한 오류분포의 평균값 비율을 산출한다는 측면에서 바람직하게는 상기 오류분포 데이터 내 특정 위치에서의 12개의 뉴클레오티드 치환 유형 각각에 대한 ω값을 산출할 수 있다.The data analysis unit may perform the step of calculating the ω value, which is a ratio of the average sequencing error distribution of mutant cells to the average sequencing error distribution of normal cells from the data obtained from the data collection unit. In terms of calculating the average ratio of the correct error distribution, it is possible to preferably calculate the ω value for each of the 12 nucleotide substitution types, and in terms of calculating the average ratio of the more accurate error distribution, preferably within the error distribution data Values of ω can be calculated for each of the 12 types of nucleotide substitutions at a particular position.

보다 구체적으로, 상기 데이터 해석부는 데이터 수집부에서 서열분석 오류분포 데이터를 획득하기 위해 사용된 정상세포 또는 변이세포로부터 유래된 리드들을 위치 정보에 따라 그룹화하는 단계; 및 동일한 위치 정보를 포함하는 정상세포 또는 변이세포로부터 유래된 리드들에 대한 서열분석 오류분포 데이터에서, 뉴클레오티드 치환 유형 각각에 대한 ω값을 산출하는 단계를 수행할 수 있다. 상기 그룹화에서, 그룹의 염기쌍 길이 단위는 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 또는 16bp일 수 있으나, 반드시 이에 제한되지 아니하고, 상기 장치를 실시하는 통상의 기술자가 리드를 복수의 그룹으로 설정하는 과정에서 적절히 선택할 수 있다.More specifically, the data analysis unit includes: grouping the reads derived from normal cells or mutant cells used to obtain sequencing error distribution data in the data collection unit according to location information; and calculating an ω value for each type of nucleotide substitution from the sequencing error distribution data for reads derived from normal cells or mutant cells including the same positional information. In the grouping, the base pair length unit of the group may be 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 or 16 bp, but is not necessarily limited thereto, and the A person skilled in the art performing the apparatus can appropriately select the leads in the process of setting the leads into a plurality of groups.

상기 데이터 수정부는 상기 데이터 해석부에서 산출된 ω값을 정상세포의 서열분석 오류분포 데이터에 가중치로 부여하여, 수정된 서열분석 오류분포 데이터를 획득하는 단계를 수행할 수 있다. 정상세포의 서열분석 오류분포 데이터에 상기 데이터 해석부에서 산출된 ω값을 곱하여 수정된 서열분석 오류분포 데이터를 획득할 수 있으며, 일례로서, ω값은 상기 수학식 1에 의해 산출될 수 있다. The data correction unit may perform the step of obtaining corrected sequencing error distribution data by assigning the ω value calculated by the data analysis unit as a weight to the sequencing error distribution data of normal cells. Corrected sequencing error distribution data may be obtained by multiplying the sequencing error distribution data of normal cells by the ω value calculated by the data analysis unit. As an example, the ω value may be calculated by Equation 1 above.

상기 데이터 수정부는 그룹화된 정상세포의 서열분석 오류분포 데이터에 상기 그룹화하여 산출된 ω값을 곱하여 수정된 서열분석 오류분포 데이터를 획득하는 단계를 수행할 수 있다. The data correction unit may perform the step of obtaining corrected sequencing error distribution data by multiplying the sequencing error distribution data of the grouped normal cells by the ω value calculated by the grouping.

상기 데이터 적용부는 상기 데이터 수정부에서 수정된 서열분석 오류분포 데이터와 인간 참조 유전체 서열에 정렬된 분석 대상 시료 내 핵산서열 분석 데이터로부터 획득한 변이 확률 분포 값간 통계적 유의성을 평가하는 단계를 수행할 수 있다. 상기 분석 대상 시료 내 핵산서열 분석 데이터로부터 획득한 변이는 뉴클레오티드 치환 변이일 수 있다.The data application unit may perform the step of evaluating the statistical significance between the mutation probability distribution value obtained from the sequencing error distribution data corrected by the data correction unit and the nucleic acid sequence analysis data in the sample to be analyzed aligned with the human reference genome sequence. . The mutation obtained from the nucleic acid sequence analysis data in the analysis target sample may be a nucleotide substitution mutation.

상기 데이터 적용부는 분석 대상 시료 내 핵산 서열 분석 데이터를 위치 정보에 따라 분류하는 하는 단계; 및 상기 분류된 핵산 서열 분석 데이터로부터 획득한 변이 확률 분포 값과 상기 데이터 수정부로부터 획득한 수정된 서열분석 오류분포 데이터간 통계적 유의성을 평가하는 단계를 수행할 수 있으며, 상기 분류된 핵산 서열 분석 데이터와 수정된 서열분석 오류분포 데이터는 동일한 위치 정보를 포함하는 리드들로부터 획득되는 것일 수 있다. Classifying the data applying unit nucleic acid sequence analysis data in the analysis target sample according to location information; and evaluating the statistical significance between the variance probability distribution value obtained from the classified nucleic acid sequence analysis data and the corrected sequencing error distribution data obtained from the data correction unit, and the classified nucleic acid sequence analysis data and corrected sequencing error distribution data may be obtained from reads including the same position information.

상기 장치는 상기 데이터 수정부에서 수정된 서열분석 오류분포 데이터와 인간 참조 유전체 서열에 정렬된 분석 대상 시료 내 핵산서열 분석 데이터로부터 획득한 변이 확률 분포 값간 통계적으로 유의한 경우 진양성 변이로 판별하는 데이터 판별부를 더 포함할 수 있다.The device determines the true-positive mutation when it is statistically significant between the sequencing error distribution data corrected by the data correction unit and the mutation probability distribution value obtained from the nucleic acid sequencing data in the analysis target sample aligned with the human reference genome sequence. It may further include a determining unit.

상기 변이 확률 분포는 유전자의 서열상 변이가 발생할 수 있는 확률을 나타내는 분포라면 특별히 제한되지 않으나, 바람직하게는 뉴클레오티드 치환 변이 확률 분포일 수 있다.The mutation probability distribution is not particularly limited as long as it is a distribution indicating the probability that a mutation may occur in the sequence of a gene, but may preferably be a nucleotide substitution mutation probability distribution.

상기 데이터 판별부는 상기 수정된 서열분석 오류분포 데이터와 인간 참조 유전체 서열에 정렬된 분석 대상 시료 내 핵산서열 분석 데이터로부터 획득한 변이 확률 분포 값간 통계적으로 유의한 차이를 나타내는 경우, 진양성 변이로 판별하거하나, 수정된 서열분석 오류분포 데이터와 인간 참조 유전체 서열에 정렬된 분석 대상 시료 내 핵산서열 분석 데이터로부터 획득한 변이 확률 분포 값간 통계적으로 유의한 차이를 나타내지 않는 경우 위양성 변이로 판별하는 단계를 수행할 수 있다. 일례로서, 상기의 통계적 유의성 판단은 하기 수학식 2에 따라 수행될 수 있다. When the data determining unit shows a statistically significant difference between the modified sequencing error distribution data and the mutation probability distribution value obtained from the nucleic acid sequencing data in the analysis target sample aligned with the human reference genome sequence, it is determined as a true positive mutation or However, if there is no statistically significant difference between the corrected sequencing error distribution data and the mutation probability distribution value obtained from the nucleic acid sequencing data in the sample to be analyzed aligned with the human reference genome sequence, the step of determining as a false positive mutation is performed. can As an example, the statistical significance determination may be performed according to Equation 2 below.

상기 장치에서, 데이터 수집부의 핵산에 대한 서열분석 오류분포 데이터 또는 데이터 적용부의 분석 대상 시료 내 핵산서열 분석 데이터는 차세대 염기서열 분석(next generation sequencing), 표적 염기서열 분석(targeted sequencing), 표적 딥 염기서열 분석(targeted deep sequencing) 또는 패널 염기서열 분석(panel sequenceing)에 의한 데이터일 수 있고, 보다 구체적으로는 차세대 염기서열 분석에 의한 데이터일 수 있다.In the device, the sequencing error distribution data of the nucleic acid of the data collection unit or the nucleic acid sequencing data in the analysis target sample of the data application unit are next generation sequencing, targeted sequencing, target deep nucleotide It may be data by targeted deep sequencing or panel sequencing, and more specifically, data by next-generation sequencing.

상기 장치에서, 상술한 진양성 변이를 판별하는 방법에 대한 구체적인 설명들과 대응되는 개념이나, 이를 포함하는 것들에 대해서는 상술한 바를 참조하여 해석될 수 있음은 당업자에 자명한 것이다.In the device, it will be apparent to those skilled in the art that concepts corresponding to the detailed descriptions of the above-described method for determining true-positive mutations, but including them, can be interpreted with reference to the above-mentioned bar.

이하 실시예를 통하여 보다 상세하게 설명한다. 그러나, 이들 실시예는 발명의 각 양태를 예시적으로 설명하기 위한 것으로 발명의 범위가 이들 실시예에 한정되는 것은 아니다.Hereinafter, it will be described in more detail through examples. However, these Examples are intended to exemplify each aspect of the invention, and the scope of the invention is not limited to these Examples.

실시예 1. 데이터의 가정Example 1. Assumption of data

암환자의 cfDNA를 변이 세포의 일례로서 활용한 변이 분석을 수행하였으며, 해당 변이가 진양성 변이인지(true positive), 위양성 변이(false positive)인지 구분하고자 하였다. 체세포 변이 이외의 변이를 제거하기 위하여 혈액으로부터 얻은 WBC(White Blood Cell)의 gDNA를 활용하여 생식세포(germline) 데이터를 제거하며, 체세포 변이가 없는 WBC의 gDNA를 활용하여 변이가 없을 때도 시퀀싱 과정에서 무작위적으로 발생하는 오류의 빈도를 측정하여 오류 분포도를 확보하였다.Mutation analysis was performed using cfDNA of cancer patients as an example of mutant cells, and it was attempted to distinguish whether the mutation was a true positive mutation or a false positive mutation. In order to remove mutations other than somatic mutations, germline data is removed by using gDNA of WBC (White Blood Cell) obtained from blood. The error distribution was obtained by measuring the frequency of randomly occurring errors.

실시예 2. 가중치가 부여된 오류분포 데이터의 획득Example 2. Acquisition of weighted error distribution data

상기 가정에서 암환자의 cfDNA는 세포로부터 유리되어 혈액을 돌아다니며 생물학적 특성에 의해 일정한 크기(약 180bp)정도로 절편화 되어 있으며, gDNA는 시퀀싱 분석을 위해 인위적인 절편화 과정을 진행하였다. 이에 따른 오류의 편향이 알려져 있으므로 해당 편향의 비율을 알기 위하여 cfDNA와 gDNA로부터 얻은 시퀀싱 데이터 여러 건으로부터 각 염기 치환(substitution) 종류별 오류율을 산출하였다. 이를 통해 얻을 수 있는 정보는 cfDNA와 gDNA의 실험적인 차이를 포함하여 환경적인 차이에서 오는 오류율의 차이이며, 이때 cfDNA는 체세포 변이가 없는 정상인의 것 또는 선행 분석을 통해 체세포(somatic) 변이가 제거된 샘플이어야 한다. 위와 같은 절차를 통해 도 3과 같이 염기 치환 12가지에 따른 오류율의 평균을 산출할 수 있다. 염기치환이 C>T로 변할 cfDNA와 gDNA의 오류율의 평균 비(ratio)를 ω_CT 라 가정할 경우, 또한, 도 4와 같이 서열분석 리드의 위치에 따라 오류율이 다를 수 있으므로, 리드의 K번째 위치의 ω_CT를 ω_CTK로 표현할 때 이를 확률모델을 수정 할 수 있는 가중치로 사용할 수 있으며, 이러한 오류분포 데이터를 수집하여 서로 다른 편향을 가진 샘플을 비교 시 오류율을 보정하는 용도로 사용할 수 있다.In this family, the cfDNA of cancer patients is separated from the cells and circulates in the blood and is fragmented to a certain size (about 180bp) according to biological characteristics, and the gDNA was artificially fragmented for sequencing analysis. Since the error bias is known, the error rate for each type of nucleotide substitution was calculated from several sequencing data obtained from cfDNA and gDNA in order to know the ratio of the corresponding bias. The information that can be obtained through this is the difference in error rate that comes from environmental differences, including the experimental difference between cfDNA and gDNA. It should be a sample. Through the above procedure, as shown in FIG. 3 , the average of the error rates according to 12 kinds of base substitutions can be calculated. If it is assumed that the average ratio of the error rates of cfDNA and gDNA whose base substitution is C>T is ω _CT , as shown in FIG. 4 , the error rate may vary depending on the location of the sequencing read, so the When ω _CT of a _{location is expressed as ω CTK} , it can be used as a weight for correcting the probabilistic model, and by collecting this error distribution data, it can be used to correct the error rate when comparing samples with different biases.

구체적으로, 도 2의 B에 나타낸 바와 같이, 정상 gDNA에 대한 오류 분포도를 배경으로 하여, ω≥1인 경우 정상 gDNA에 대한 오류 분포도는 우측으로 shift시키고, ω≤1인 경우 정상 gDNA에 대한 오류 분포도는 좌측으로 shift 시킴으로써, 가중치가 부여된 오류분포 데이터를 획득하였다.Specifically, as shown in B of FIG. 2 , on the background of the error distribution for normal gDNA, when ω≥1, the error distribution for normal gDNA is shifted to the right, and when ω≤1, error for normal gDNA By shifting the distribution to the left, weighted error distribution data were obtained.

실시예 3. 가중치가 부여된 변이확률에 대한 통계적 유의성 평가 및 진양성 변이의 판별Example 3. Evaluation of statistical significance for weighted variance probability and determination of true positive variance

가중치가 부여되기 전의 P(X≥VAF_variation)는 관찰한 변이의 빈도가 정상세포로부터 얻은 확률분포 X에서 관찰될 빈도를 나타낸다. 이 때, 만약 관찰된 VAF가 C>T 변이인 경우, 도 3에 나타낸 바와 같이 cfDNA의 분포에서 gDNA보다 오류율이 높으며 이를 표준화하여 관찰된 변이를 판단할 때, 대상 변이의 유효성을 과장하여 판단할 수 있기 때문에 ω_CT를 확률 분포 X에 적용하여 P(X * ω_CT≥VAF_variation)로 해당 변이의 통계 검정을 진행할 수 있다. 또한, 도 4와 같이 시퀀싱 리드의 위치마다 발생하는 오류율이 다르다는 것을 포함하여 적용 가능하다. 이때는 시퀀싱 리드의 위치마다 부분집합 데이터 K를 생성한 후 평가하고자 하는 변이에 해당하는 P(X *ω_CTK≥VAF_variation)를 활용할 수 있다. Before weighting, P(X≥VAF _variation ) represents the frequency with which the observed variation is observed in the probability distribution X obtained from normal cells. At this time, if the observed VAF is a C>T mutation, as shown in Figure 3, the error rate is higher than that of gDNA in the distribution of cfDNA. Therefore _{, by applying ω CT} to the probability distribution X, it is possible to perform a statistical test of the variation _{with P(X * ω CT} ≥VAF _{variation ).} In addition, as shown in FIG. 4 , it is applicable including the fact that the error rate generated for each position of the sequencing read is different. In this case, after generating the subset data K for each position of the sequencing read, P(X *ω _CTK ≥VAF _variation ) corresponding to the variation to be evaluated may be utilized.

구체적으로, 시퀀싱 리드의 위치에 따른 오류율을 적용할 경우 위치는 1~5bp 등 범위로도 활용될 수 있으며, 평가하고자 하는 어느 C>T 변이를 포함하는 리드가 총 10개이며 해당 변이가 각각 리드에서 1~10번째 위치에 나타나고 시퀀싱 리드의 위치에 따른 오류는 리드의 끝에서 5bp씩 그룹을 짓는다라고 가정하였다. 그 결과에 따라, 예를 들어, 도 5에 나타낸 바와 같이, Bin 1에 대하여 P(X≥1/ω_CT(1~5)*VAF_variation), 및 Bin 2에 대하여, P(X≥1/ω_CT(6~10)*VAF_variation), 구체적으로, 염기 치환(substitution) 종류 및 위치에 관한 가중치가 부여된 오류분포 데이터와 이들이 통계적으로 유의적 차이를 나타내는 경우, 진양성 변이로 판별할 수 있다. Specifically, when the error rate according to the position of the sequencing read is applied, the position can be utilized in a range such as 1 to 5 bp, and there are a total of 10 reads containing any C>T mutation to be evaluated, and the mutation is each read. It was assumed that it appears in positions 1 to 10 and that errors depending on the position of the sequencing read group 5 bp at the end of the read. Depending on the results, for example, as shown in Fig. 5, P(X≥1/ω _CT(1-5) *VAF _variation ) for Bin 1, and P(X≥1/ω for Bin 2) ω _CT(6~10) *VAF _variation ), specifically, weighted error distribution data regarding the type and position of base substitution and if they show a statistically significant difference, it can be determined as a true positive variation have.

전술한 본 발명의 설명은 예시를 위한 것이며, 본 발명이 속하는 기술분야의 통상의 지식을 가진 자는 본 발명의 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 쉽게 변형이 가능하다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다.The above description of the present invention is for illustration, and those of ordinary skill in the art to which the present invention pertains can understand that it can be easily modified into other specific forms without changing the technical spirit or essential features of the present invention. will be. Therefore, it should be understood that the embodiments described above are illustrative in all respects and not restrictive.

110: 서열분석 오류분포 데이터를 산출하는 단계
120: 정상세포 대비 변이세포에 대한 서열분석 오류분포 평균값의 비율을 산출하는 단계
130: 수정된 서열분석 오류분포 데이터를 획득하는 단계
140: 수정된 서열분석 오류분포 데이터와 분석 대상 시료 내 핵산 서열 분석 데이터간 통계적 유의성을 평가하는 단계
300: 진양성 변이를 판별하는 장치
310: 데이터 수집부
320: 데이터 해석부
330: 데이터 적용부
340: 데이터 판별부110: Calculating sequencing error distribution data
120: Calculating the ratio of the average value of the sequencing error distribution for mutant cells compared to normal cells
130: Acquiring corrected sequencing error distribution data
140: evaluating the statistical significance between the corrected sequencing error distribution data and the nucleic acid sequencing data in the sample to be analyzed
300: device for determining true-positive mutations
310: data collection unit
320: data analysis unit
330: data application unit
340: data determination unit

Claims

In a computer-based system,
(a) independently obtaining sequencing error distribution data of normal cells and sequencing error distribution data of mutant cells for nucleic acids derived from normal cells or mutant cells that do not contain somatic and germline mutations;
(b) calculating an ω value, which is a ratio of the average sequencing error distribution of mutant cells to the average sequencing error distribution of normal cells, from the data;
(c) assigning the calculated ω value to the sequencing error distribution data of normal cells as a weight to obtain corrected sequencing error distribution data; and
(d) evaluating the statistical significance between the modified sequencing error distribution data and the mutation probability distribution value obtained from the nucleic acid sequencing data in the analysis target sample aligned with the human reference genome sequence,
In step (c), corrected sequencing error distribution data is obtained by multiplying the sequencing error distribution data of normal cells by the ω value calculated in step (b).

The method according to claim 1, When the ω value calculated in step (b) is greater than 1, the sequencing error distribution of normal cells is shifted to the right, and the ω value calculated in step (b) is greater than 1. If it is small, the method for determining true positive mutations is to shift the sequencing error distribution of normal cells to the left.

delete

The method according to claim 1, wherein step (b) comprises calculating an ω value for each type of nucleotide substitution at a specific position in the sequencing error distribution data.

The method according to claim 5, wherein step (b) comprises: grouping the reads derived from normal cells or mutant cells used to obtain the sequencing error distribution data of step (a) according to position information; and
A method comprising the step of calculating an ω value for each type of nucleotide substitution from sequencing error distribution data for reads derived from normal cells or mutant cells containing the same positional information.

delete

The method according to claim 1, wherein the step (c) comprises obtaining corrected sequencing error distribution data by multiplying the sequencing error distribution data of grouped normal cells by the ω value calculated according to claim 6 .

The method according to claim 1, wherein in step (d), the mutation obtained from the nucleic acid sequence analysis data in the sample to be analyzed is a nucleotide substitution mutation.

The method according to claim 9, wherein the step (d) comprises: classifying the nucleic acid sequence analysis data in the analysis target sample according to location information; and
Comprising the step of evaluating the statistical significance between the variance probability distribution value obtained from the classified nucleic acid sequencing data and the corrected sequencing error distribution data obtained by claim 8,
The classified nucleic acid sequencing data and the corrected sequencing error distribution data are obtained from reads containing the same positional information.

The method according to claim 1, If a statistically significant difference between the modified sequencing error distribution data and the mutation probability distribution value obtained from the nucleic acid sequencing data in the analysis target sample aligned with the human reference genome sequence is indicated, it is determined as a true positive mutation. A method further comprising the step of:

A computer-readable recording medium containing a computer program for performing the method of any one of claims 1, 2, 5, 6, and 8 to 11.

a data collection unit for independently acquiring sequencing error distribution data of normal cells and sequencing error distribution data of mutant cells with respect to nucleic acids derived from normal cells or mutant cells that do not contain somatic and germline mutations;
a data analysis unit for calculating an ω value, which is a ratio of an average sequencing error distribution of mutant cells to an average sequencing error distribution of normal cells, from the data;
a data correction unit for obtaining corrected sequencing error distribution data by assigning the calculated ω value to the sequencing error distribution data of normal cells as a weight; and
A data application unit for evaluating the statistical significance between the modified sequencing error distribution data and the mutation probability distribution value obtained from the nucleic acid sequencing data in the analysis target sample aligned with the human reference genome sequence,
Wherein the data correction unit obtains corrected sequencing error distribution data by multiplying the sequencing error distribution data of normal cells by the ω value calculated by the data analysis unit.

The apparatus of claim 13 , wherein the data analysis unit calculates an ω value for each type of nucleotide substitution.

The apparatus of claim 13 , wherein the data analysis unit calculates an ω value for each type of nucleotide substitution at a specific position in the sequencing error distribution data.

The method according to claim 15, wherein the data analysis unit grouping the reads derived from normal cells or mutant cells used to obtain sequencing error distribution data in the data collection unit according to location information; and
An apparatus for calculating an ω value for each type of nucleotide substitution from sequencing error distribution data for reads derived from normal cells or mutant cells containing the same positional information.

The method according to claim 13, wherein the data correction unit shifts the sequencing error distribution of normal cells to the right when the ω value calculated by the data analysis unit is greater than 1, and the ω value calculated by the data analysis unit is 1 If it is smaller, the device will shift the sequencing error distribution of normal cells to the left.

The apparatus of claim 17 , wherein the data correction unit obtains corrected sequencing error distribution data by multiplying the sequencing error distribution data of the grouped normal cells by the ω value calculated by the method 16 .

The device of claim 13, wherein, in the data application unit, the mutation obtained from the nucleic acid sequence analysis data in the sample to be analyzed is a nucleotide substitution mutation.

The method according to claim 19, wherein the data application unit categorizing the nucleic acid sequence analysis data in the analysis target sample according to location information; and
Evaluating the statistical significance between the variance probability distribution value obtained from the classified nucleic acid sequencing data and the corrected sequencing error distribution data obtained by claim 18,
The apparatus of claim 1, wherein the classified nucleic acid sequencing data and corrected sequencing error distribution data are obtained from reads containing the same position information.

The method according to claim 13, wherein the modified sequencing error distribution data and the mutation probability distribution value obtained from the nucleic acid sequencing data in the analysis target sample aligned with the human reference genome sequence is statistically significant, a data discrimination unit for determining a true positive mutation More inclusive devices.