KR20230039218A

KR20230039218A - Method of displaying paired-end-read merge for next generation sequencing

Info

Publication number: KR20230039218A
Application number: KR1020210122244A
Authority: KR
Inventors: 정준혁; 박승구; 서성현
Original assignee: (주)디엑솜
Priority date: 2021-09-14
Filing date: 2021-09-14
Publication date: 2023-03-21
Also published as: WO2023043097A1

Abstract

The present application relates to a method for merging and displaying paired sequence fragments for next-generation sequencing analysis. The present application merges two sequences and uses them for sequencing to reduce sequencing data processing time by half. When visualizing sequence alignment, things expressed by two sequences are reduced to one, respectively to reduce a data storage space. It is possible to determine a target region coverage and a sequencing error of a sequence only with merged sequence information without additional information, thereby more efficiently performing next-generation sequencing analysis.

Description

METHOD OF DISPLAYING PAIRED-END-READ MERGE FOR NEXT GENERATION SEQUENCING}

본원은, 차세대 염기서열 분석을 위한 짝지어진 서열조각 병합 표시 방법에 관한 것이다. The present invention relates to a method for merging paired sequence fragments for next-generation sequencing analysis.

다양한 생체 정보는 DNA 서열의 유전자로 표현되고, 개체의 완전한 DNA 서열 정보는 생명현상을 이해하고 질병과 관련된 정보를 얻을 수 있어 매우 중요하다. DNA 서열 정보의 해독, 즉, 게놈 시퀀싱(genome sequencing)의 핵심은 개인차 및 민족적 특성을 파악하거나 유전자 이상과 관련된 질환에서 염색체 이상을 포함한 선천성 원인의 규명과 당뇨병, 고혈압과 같은 복합 질병의 유전자 결함을 찾기 위한 것이다. 또한, 시퀀싱 데이터는 유전자 발현, 유전자 다양성 및 그 상호작용 등의 정보들을 분자 진단과 치료 영역에서 폭넓게 활용할 수 있어 매우 중요하다. Various biometric information is expressed in DNA sequence genes, and complete DNA sequence information of an individual is very important to understand life phenomena and obtain disease-related information. The key to decoding DNA sequence information, that is, genome sequencing, is to identify individual differences and ethnic characteristics, to identify congenital causes including chromosomal abnormalities in diseases related to genetic abnormalities, and to identify genetic defects in complex diseases such as diabetes and hypertension. is to find In addition, sequencing data is very important because information such as gene expression, gene diversity, and their interactions can be widely used in the field of molecular diagnosis and treatment.

게놈 시퀀싱을 위한 방법으로 2007년 이래로 차세대 염기서열 분석법(Next Generation Sequencing: NGS)이 이용되기 시작하였고, NGS의 개발에 따라 전통적인 방법과 비교하여 훨씬 쉽고 저비용으로 분석할 수 있게 되었다. 차세대 염기서열 분석법을 구현하는 차세대 게놈 시퀀서(Next Generation Sequencer)로 대표적인 것으로는 로슈(Roche)/454, 일루미나(Illumina)/Solexa 및 라이프 테크놀로지스(ABI)의 SOLiD 등이 있다. 이러한 차세대 염기서열 분석 기기들은 7시간에 8,000만개 이상의 서열 판독이 가능하다. 이러한 기술 발전으로 종래에는 막대한 검사 비용으로 인해 연구용으로만 사용되던 차세대 염기서열 분석법을 의료용 임상 검사에서 활용할 수 있게 되었다.As a method for genome sequencing, Next Generation Sequencing (NGS) has been used since 2007, and with the development of NGS, it has become much easier and cheaper to analyze compared to traditional methods. Representative next-generation genome sequencers that implement next-generation sequencing methods include SOLiD of Roche/454, Illumina/Solexa, and Life Technologies (ABI). These next-generation sequencing devices can read more than 80 million sequences in 7 hours. Such technological advances have made it possible to utilize next-generation sequencing methods, which were conventionally used only for research due to enormous test costs, in medical clinical tests.

한편, 원하는 유전자 부위를 보고자 한다면 분석하고자 하는 부분의 DNA 혹은 RNA를 선별해야 하는데 이것을 타겟 선별(target enrichment)이라 하고 이렇게 NGS 분석을 하는 것을 타겟 패널 시퀀싱(targeted sequencing)이라고 한다. 타겟 선별은 PCR 프라이머(primer)로 증폭을 하는 앰플리콘(amplicon) 방법과 프로브(probe)를 이용하여 교합(hybridization)하는 캡쳐(capture) 방법으로 나뉜다. PCR 앰플리콘 방식은 검사 소요시간이 더 짧고, 상대적으로 적은 양의 DNA를 필요로하여 잘 디자인된 작은 수의 유전자 패널에 대한 검사에 유용하지만, 패널의 유전자 수가 많아지거나 엑솜 시퀀싱(exome sequencing)을 수행하는 경우에는 프로브 방식이 유리하다.Meanwhile, in order to see a desired gene region, DNA or RNA of the region to be analyzed must be selected. This is called target enrichment, and performing NGS analysis in this way is called target panel sequencing. Target selection is divided into an amplicon method of amplification with PCR primers and a capture method of hybridization using a probe. The PCR amplicon method is useful for testing small, well-designed gene panels because it requires a shorter test time and requires a relatively small amount of DNA, but it is difficult to use when the number of genes in a panel increases or when exome sequencing is required. When performing, the probe method is advantageous.

차세대 염기서열 분석을 보다 효율적으로 수행하는 방법에 대한 연구 결과로 차세대 염기서열 분석을 위한, 마이크로웨이브를 이용한 DNA 추출방법 및 이의 용도 (대한민국 등록특허 제 10-2177386호) 등이 있으나, 차세대 염기서열 분석을 보다 효율적으로 수행하는 방법에 대한 개발 및 연구가 여전히 필요한 실정이다.As a result of research on how to perform next-generation sequencing more efficiently, a DNA extraction method using microwave for next-generation sequencing analysis and its use (Korean Registered Patent No. 10-2177386), etc. There is still a need for development and research on how to perform the assay more efficiently.

이에, 본 발명자들은 차세대 염기서열 분석을 위한 짝지어진 서열조각 병합 표시 방법을 개발하여, 본 발명을 완성하였다.Accordingly, the present inventors have developed a method for merging and displaying paired sequence fragments for next-generation sequencing analysis, thereby completing the present invention.

한국등록특허공보 10-2177386호Korean Registered Patent Publication No. 10-2177386

본원은, 차세대 염기서열 분석을 위한 짝지어진 서열조각 병합 표시 방법에 관한 것이다.The present invention relates to a method for merging paired sequence fragments for next-generation sequencing analysis.

그러나, 본원이 해결하고자 하는 과제는 이상에서 언급한 과제로 제한되지 않으며, 언급되지 않은 또 다른 과제들은 아래의 기재로부터 당업자에게 명확하게 이해될 수 있을 것이다.However, the problem to be solved by the present application is not limited to the above-mentioned problem, and other problems not mentioned will be clearly understood by those skilled in the art from the description below.

본원의 제1측면은, 차세대 염기서열 분석을 위한 짝지어진 서열조각 병합 표시 방법을 제공한다.A first aspect of the present disclosure provides a method for merging paired sequence fragments for next-generation sequencing analysis.

본원의 제2측면은, 차세대 염기서열 분석을 위한 짝지어진 서열조각 병합 표시에 대한 해석 방법을 제공한다.A second aspect of the present disclosure provides an analysis method for paired sequence fragment merge display for next-generation sequencing analysis.

본원은 2개의 서열을 병합하여 시퀀싱에 이용하기 때문에 시퀀싱 데이터 처리 시간을 1/2로 줄일 수 있으며, 서열정렬 시각화시 각각 2개의 서열로 표현되었던 것이 하나로 줄어들기 때문에 데이터의 저장 공간을 감소시킬 수 있으며, 추가 정보 없이 병합된 서열 정보만으로 서열의 타겟영역 범위성과 시퀀싱 에러를 판단할 수 있어 더욱 효율적으로 차세대 염기서열 분석을 수행할 수 있다.Since the present application merges two sequences and uses them for sequencing, the sequencing data processing time can be reduced by half, and the data storage space can be reduced because each of the two sequences expressed in sequence alignment visualization is reduced to one. In addition, since the target region coverage and sequencing error of the sequence can be determined only with the merged sequence information without additional information, next-generation sequencing analysis can be performed more efficiently.

도 1은, 타겟 시퀀싱에서 보이는 짝지어진 서열조각의 겹침을 나타낸 도면이다.
도 2는, 타겟 참조서열에 대한 R1, R2 리드서열의 겹침이 완벽히 일치하는 경우 병합하는 방법을 나타낸 도면이다.
도 3은, 타겟 참조서열에 대한 R1, R2 리드서열의 정보를 알 수 없는 경우의 병합 표시 방법을 나타낸 도면이다.
도 4는, 두 개의 짝지어진 서열 조각의 겹쳐진 부분의 염기가 다를 경우 병합 표시 방법을 나타낸 도면이다.
도 5는, 두 개의 짝지어진 서열 조각에 겹쳐진 부분이 없는 경우 병합 표시 방법을 나타낸 도면이다.
도 6은, 본원 병합 표시 방법에 따라 병합 표시한 예시(소문자 n이 많은 경우)를 나타낸 도면이다.
도 7은, 본원 병합 표시 방법에 따라 병합 표시한 예시(소문자로 표시된 염기가 많은 경우)를 나타낸 도면이다.
도 8은, 기존의 방법과 본원 방법에 따라 서열정렬을 시각화한 경우를 비교한 것을 나타낸 도면이다. 1 is a diagram showing the overlap of paired sequence fragments seen in target sequencing.
2 is a diagram showing a method of merging when the overlapping of the R1 and R2 lead sequences with respect to the target reference sequence is perfectly matched.
FIG. 3 is a diagram illustrating a merge display method when information on R1 and R2 lead sequences for a target reference sequence is unknown.
4 is a diagram showing a merge display method when the overlapping bases of two paired sequence fragments are different.
5 is a diagram showing a merge display method when two paired sequence fragments do not overlap.
6 is a diagram showing an example of merge display (when there are many lowercase n's) according to the merge display method of the present application.
FIG. 7 is a diagram showing an example of merge display (when there are many bases indicated by lowercase letters) according to the merge display method of the present application.
8 is a diagram showing a comparison between a conventional method and a case in which sequence alignment is visualized according to the present method.

아래에서는 첨부한 도면을 참조하여 본원이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 본원의 실시예를 상세히 설명한다. 그러나 본원은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시예에 한정되지 않는다. 그리고 도면에서 본원을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다.Hereinafter, embodiments of the present application will be described in detail so that those skilled in the art can easily practice with reference to the accompanying drawings. However, the present disclosure may be implemented in many different forms and is not limited to the embodiments described herein. And in order to clearly describe the present application in the drawings, parts irrelevant to the description are omitted, and similar reference numerals are attached to similar parts throughout the specification.

본원 명세서 전체에서, 어떤 부재가 다른 부재 “상에” 위치하고 있다고 할 때, 이는 어떤 부재가 다른 부재에 접해 있는 경우뿐 아니라 두 부재 사이에 또 다른 부재가 존재하는 경우도 포함한다.Throughout the present specification, when a member is said to be located “on” another member, this includes not only a case where a member is in contact with another member, but also a case where another member exists between the two members.

본원 명세서 전체에서, 어떤 부분이 어떤 구성 요소를 “포함” 한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성 요소를 제외하는 것이 아니라 다른 구성 요소를 더 포함할 수 있는 것을 의미한다. 본원 명세서 전체에서 사용되는 정도의 용어 “약”, “실질적으로” 등은 언급된 의미에 고유한 제조 및 물질 허용오차가 제시될 때 그 수치에서 또는 그 수치에 근접한 의미로 사용되고, 본원의 이해를 돕기 위해 정확하거나 절대적인 수치가 언급된 개시 내용을 비양심적인 침해자가 부당하게 이용하는 것을 방지하기 위해 사용된다. 본원 명세서 전체에서 사용되는 정도의 용어 “~(하는) 단계” 또는 “~의 단계”는 “~ 를 위한 단계”를 의미하지 않는다.Throughout the specification, when a part "includes" a certain component, it means that it may further include other components without excluding other components unless otherwise stated. As used throughout this specification, the terms "about", "substantially", and the like, are used at or approximating that number when manufacturing and material tolerances inherent in the stated meaning are given, and do not convey the understanding of this application. Accurate or absolute figures are used to help prevent exploitation by unscrupulous infringers of the disclosed disclosure. The term "step of (doing)" or "step of" used throughout the present specification does not mean "step for".

본원 명세서 전체에서, 마쿠시 형식의 표현에 포함된 “이들의 조합(들)”의 용어는 마쿠시 형식의 표현에 기재된 구성 요소들로 이루어진 군에서 선택되는 하나 이상의 혼합 또는 조합을 의미하는 것으로서, 상기 구성 요소들로 이루어진 군에서 선택되는 하나 이상을 포함하는 것을 의미한다.Throughout this specification, the term “combination(s) of these” included in the expression of the Markush form means a mixture or combination of one or more selected from the group consisting of the components described in the expression of the Markush form, It means including one or more selected from the group consisting of the above components.

본원 명세서 전체에서, “A 및/또는 B”의 기재는 “A 또는 B, 또는 A 및 B”를 의미한다.Throughout this specification, reference to “A and/or B” means “A or B, or A and B”.

이하, 첨부된 도면을 참조하여 본원의 구현예 및 실시예를 상세히 설명한다. 그러나, 본원이 이러한 구현예 및 실시예와 도면에 제한되지 않을 수 있다.Hereinafter, embodiments and embodiments of the present disclosure will be described in detail with reference to the accompanying drawings. However, the disclosure may not be limited to these embodiments and examples and drawings.

본원의 제 1 측면은, 차세대 염기서열 분석을 위한 짝지어진 서열조각 병합 표시 방법을 제공한다.A first aspect of the present disclosure provides a method for merging paired sequence fragments for next-generation sequencing analysis.

본원의 일 구현예에 따르면, 본원은 차세대 염기서열 분석을 위한 짝지어진 서열조각 병합 표시 방법에 있어서,According to one embodiment of the present application, in the method for merging paired sequence fragments for next-generation sequencing analysis,

(a) 2개의 짝지어진 서열조각을 나열하여 겹침부분을 확인하는 단계; (a) arranging the two paired sequence pieces to identify overlapping portions;

(b) 상기 겹침부분을 기준으로 2개의 서열조각을 병합하는 단계; 및(b) merging the two sequence fragments based on the overlapping portion; and

(c) 상기 병합한 서열조각을 표시하는 단계를 포함하며,(c) displaying the merged sequence fragments;

상기 표시하는 단계에 있어서, In the displaying step,

2개의 서열 중 타겟의 리드 서열의 염기정보를 알 수 없는 부분이 있는 경우 참조 서열의 염기를 그대로 표시하되 소문자로 표시하며, If there is a part where the base information of the target lead sequence is unknown among the two sequences, the base of the reference sequence is displayed as it is, but in lowercase letters,

2개의 서열의 겹쳐진 부분의 염기가 다를 경우 염기가 다른 부분은 소문자 n으로 표시하며, If the bases of the overlapping part of the two sequences are different, the part with different bases is indicated by a lowercase letter n,

2개의 서열의 겹친 부분의 염기가 없는 경우 참조 서열의 염기를 그대로 표시하되 소문자로 표시하는 것을 특징으로 하는, 차세대 염기서열 분석을 위한 짝지어진 서열조각 병합 표시 방법을 제공한다(도 2 내지 도 5 참조).Provided is a paired sequence fragment merging display method for next-generation sequencing analysis, characterized in that the bases of the reference sequence are displayed as they are in lowercase letters when there is no base in the overlapping portion of the two sequences (FIGS. 2 to 5 reference).

본원 명세서 전체에서 사용되는 용어, "짝지어진 서열 조각(paired-end read)"이란 cDNA fragment의 forward와 reverse방향으로 양 끝단을 시퀀싱한 리드(조각)를 의미한다.As used throughout the present specification, the term "paired-end read" means a read (fragment) obtained by sequencing both ends of a cDNA fragment in the forward and reverse directions.

본원 명세서 전체에서 사용되는 용어, "리드 서열(read sequence, 또는 줄여서 "리드(read)"로 지칭)이란 차세대 염기서열 분석법(NGS)을 통해 분석한 하나의 핵산 조각을 의미한다. 리드 서열의 길이는 게놈 시퀀서의 종류에 따라 일반적으로 35~500bp(base pair) 정도로 다양하게 구성되며, 일반적으로 DNA 염기의 경우 A, T, G, C의 알파벳 문자로 표현된다.As used throughout the present specification, the term "read sequence (or simply referred to as "read") refers to a single nucleic acid fragment analyzed through next-generation sequencing (NGS). Length of read sequence is generally composed of 35 to 500 bp (base pair) depending on the type of genome sequencer, and is generally represented by alphabet letters A, T, G, and C in the case of DNA bases.

본원 명세서 전체에서 사용되는 용어, "참조 서열(reference sequence, ref)"이란 상기 리드 서열들로부터 전체 염기 서열을 생성하는 데 참조가 되는 염기 서열을 의미한다. 염기 서열 분석에서는 게놈 시퀀서에서 출력되는 다량의 리드들을 참조 서열을 참조하여 맵핑함으로써 전체 염기 서열을 완성하게 된다. 본 발명에서 상기 참조 서열은 염기 서열 분석 시 미리 설정된 서열(예를 들어 인간의 전체 염기 서열 등)일 수도 있으며, 또는 게놈 시퀀서에서 만들어진 염기 서열을 참조 서열로 사용할 수도 있다.The term "reference sequence (ref)" used throughout the present specification means a base sequence that is a reference for generating the entire base sequence from the read sequences. In base sequence analysis, the entire base sequence is completed by mapping a large amount of reads output from a genome sequencer with reference to a reference sequence. In the present invention, the reference sequence may be a sequence set in advance during nucleotide sequence analysis (eg, the entire human nucleotide sequence), or a nucleotide sequence generated by a genome sequencer may be used as a reference sequence.

본원 명세서 전체에서 사용되는 용어, "염기(base)"는 참조 서열 및 리드 서열을 구성하는 최소 단위이다. DNA의 경우 A, T, G 및 C의 네 종류의 알파벳 문자로 구성될 수 있으며, 이들 각각을 염기라 표현한다. 즉, DNA의 경우 4개의 염기로 표현되며, 이는 리드 서열 또한 마찬가지이다.The term "base" used throughout the specification is the smallest unit constituting a reference sequence and a lead sequence. In the case of DNA, it can be composed of four types of alphabetic characters A, T, G, and C, and each of these is expressed as a base. That is, in the case of DNA, it is expressed by 4 bases, and this is also true of the lead sequence.

본원의 일 구현예에 따르면, 본원은 2개로 나누어진 서열이 하나로 병합되기 때문에 서열 정렬 이후 시퀀싱 데이터 처리 시간이 1/2로 감소하고, 각각 2개의 라인으로 표현되었던 것이 하나로 줄어들기 때문에 데이터의 저장 공간이 감소하여 더욱 효율적으로 차세대 염기서열 분석을 수행할 수 있다. According to one embodiment of the present application, since the two sequences are merged into one, the sequencing data processing time after sequence alignment is reduced by half, and the data stored in two lines is reduced to one. As space is reduced, next-generation sequencing can be performed more efficiently.

본원의 제 2 측면은, 차세대 염기서열 분석을 위한 짝지어진 서열조각 병합 표시에 대한 해석 방법을 제공한다. 본원의 제1측면과 중복되는 내용은 본원의 제2측면의 방법에도 공히 적용된다.A second aspect of the present disclosure provides an analysis method for paired sequence fragment merge display for next-generation sequencing analysis. Content overlapping with the first aspect of the present application is also applied to the method of the second aspect of the present application.

본원의 일 구현예에 따르면, 본원은 상기 짝지어진 서열조각 병합 표시 방법에 따라 병합 표시된 서열조각을 해석하는 방법을 제공한다(도 2 내지 도 5 참조).According to one embodiment of the present application, the present application provides a method of interpreting a merged-marked sequence fragment according to the paired sequence fragment merge-marking method (see FIGS. 2 to 5).

병합 표시된 서열조각을 해석하는 방법에 있어서, 병합 표시된 서열조각에 소문자로 표시된 염기(a, t, g, c) 개수가 많을수록 타겟영역 시퀀싱 범위가 좁은 것으로 해석하며, 병합 표시된 서열조각에 소문자 n의 개수가 많을수록 시퀀싱 에러가 많은 것으로 해석할 수 있다.In the method of interpreting merged marked sequence fragments, the higher the number of bases (a, t, g, c) indicated by lowercase letters in the merged marked sequence fragments, the narrower the sequencing range of the target region is interpreted, and the lowercase n's in the merged marked sequence fragments. It can be interpreted that the larger the number, the larger the sequencing error.

본원의 일 구현예에 따르면, 본원은 추가 정보 없이 병합된 서열 정보만으로 서열의 타겟 영역 범위성과 시퀀싱 에러를 판단할 수 있다.According to one embodiment of the present application, the present application may determine target region coverage and sequencing error of a sequence only with merged sequence information without additional information.

이하, 본원의 실시예를 통하여 본 발명을 더욱 상세하게 설명하고자 하나, 하기의 실시예는 본원의 이해를 돕기 위하여 예시하는 것 일뿐, 본원의 내용이 하기 실시예에 한정되는 것은 아니다.Hereinafter, the present invention will be described in more detail through examples of the present application, but the following examples are only exemplified to aid understanding of the present application, and the content of the present application is not limited to the following examples.

[실시예][Example]

타겟 참조서열에 따른 서열조각 병합Merging sequence fragments according to target reference sequence

1. 타겟 참조서열을 중심으로 R1, R2 서열을 결합(겹쳐지는 부분의 서열이 같은 경우)1. Combining the R1 and R2 sequences around the target reference sequence (if the overlapping sequence is the same)

R1 리드서열과 R2 리드서열의 서열을 겹쳐지는 부분을 중심으로 해서 결합하였다. 그 결과, R1 리드서열과 R2 리드서열의 서열이 결합된 서열을 도출하였다(도 2 참조).The sequences of the R1 lead sequence and the R2 lead sequence were combined centering on the overlapping portion. As a result, a sequence in which the sequences of the R1 lead sequence and the R2 lead sequence were combined was derived (see FIG. 2).

2. 타겟 참조서열에 대한 리드 서열 정보를 알 수 없는 경우2. When lead sequence information for the target reference sequence is unknown

R1 리드서열과 R2 리드서열 중 서열의 정보를 알 수 없는 부분이 있는 경우, 일단 정보를 알 수 없는 부분의 타겟 참조서열을 그대로 가져오되 소문자로 표시하여 결합하였다(도 3 참조).If there is a part of the sequence information of the R1 lead sequence and the R2 lead sequence, the target reference sequence of the unknown part is imported as it is, but indicated in lowercase letters and combined (see FIG. 3).

3. 타겟 참조서열의 겹쳐진 염기가 다를 경우 3. When overlapping bases of the target reference sequence are different

R1 리드서열과 R2 리드서열의 겹쳐지는 부분 중 두개의 염기가 다른 부분이 있는 경우, 소문자 "n"으로 표시하여 결합하였다(도 4 참조).When there is a part where two bases are different among the overlapping parts of the R1 lead sequence and the R2 lead sequence, the combination is indicated by a lowercase letter “n” (see FIG. 4).

4. 타겟 참조서열이 겹침이 없는 경우 4. When the target reference sequence does not overlap

R1 리드서열과 R2 리드서열의 겹쳐지는 부분의 서열이 나타나 있지 않은 경우, 겹쳐지는 부분의 타겟 참조서열을 그대로 가져오되 소문자로 표시하여 결합하였다(도 5 참조).When the sequence of the overlapping part of the R1 lead sequence and the R2 lead sequence was not shown, the target reference sequence of the overlapping part was imported as it was, but indicated in lowercase letters and combined (see FIG. 5).

상기 1 내지 4에 기재한 방법에 따라 R1, R2 리드 서열을 병합하여 표시한 실제 예시를 표시하였다(도 6 및 도 7 참조).Actual examples of merging and displaying the R1 and R2 lead sequences according to the methods described in 1 to 4 above are shown (see FIGS. 6 and 7).

상기 기재한 바와 같이 병합 표시한 후에 시퀀싱을 수행한 결과, 2개로 나누어진 서열이 하나로 병합되었기 때문에 서열 정렬 이후 시퀀싱 데이터 처리 시간이 1/2로 감소하며, 각각 2개의 라인으로 표현되었던 것이 하나로 줄어들기 때문에 데이터의 저장 공간 역시 감소한다.As a result of performing sequencing after merge marking as described above, since the two sequences were merged into one, the sequencing data processing time after sequence alignment was reduced by half, and each of the two lines was reduced to one. As a result, the storage space for data is also reduced.

병합된 서열조각을 바탕으로 결과 해석Interpretation of results based on merged sequence fragments

상기 1 내지 4의 방법에 따라 서열조각을 병합하여 표시한 경우 하기와 같이 해석할 수 있다. When the sequence fragments are merged and displayed according to the methods 1 to 4 above, it can be interpreted as follows.

소문자 a, t, g, c의 개수 = 타겟 영역 시퀀싱 범위성(소문자 a, t, g, c의 개수가 많을수록 시퀀싱 범위가 좁다)Number of lowercase letters a, t, g, c = target region sequencing coverage (the higher the number of lowercase letters a, t, g, c, the narrower the sequencing coverage)

소문자 n의 개수 = 시퀀싱 에러 정도(소문자 n의 개수가 많을수록 시퀀싱 에러가 많다)The number of lowercase n = the degree of sequencing error (the greater the number of lowercase n, the greater the sequencing error)

이를 바탕으로 상기에서 병합한 샘플 서열조각을 하기와 같이 해석하였다.Based on this, the sample sequence fragments merged above were analyzed as follows.

도 6에서 보듯이, 2번째 줄(R1 리드서열)과 3번째 줄(R2 리드서열)을 상기 기재한 방법대로 결합하여 표시한 경우(4번째 줄), R1 리드서열과 R2 리드서열의 일부 염기가 달라 소문자로 표시된 부분이 몇 군데 있었고, 이 경우 해당 서열 조각 시퀀싱의 퀄리티에 대한 신뢰를 할 수 없다고 판단할 수 있다.As shown in FIG. 6, when the second line (R1 lead sequence) and the third line (R2 lead sequence) are combined and displayed in the method described above (4th line), some bases of the R1 lead sequence and the R2 lead sequence There were several parts marked with lowercase letters because of the difference, and in this case, it can be judged that the quality of the sequence fragment sequencing cannot be trusted.

도 7에서 보듯이, 2번째 줄(R1 리드서열)과 3번째 줄(R2 리드서열)을 상기 기재한 방법대로 결합하여 표시한 경우(4번째 줄), 참조서열의 정보를 알 수 없어 소문자로 표시된 부분이 많았고, 이 경우 타겟 영역을 모두 판단하기 어려운 서열인 것으로 볼 수 있다.As shown in FIG. 7, when the second line (R1 lead sequence) and the third line (R2 lead sequence) are combined and displayed in the method described above (4th line), the information of the reference sequence is not known, so it is written in lowercase. There were many marked parts, and in this case, it can be regarded as a sequence that is difficult to determine all of the target regions.

상기 샘플 서열조각의 해석 결과와 같이, 본원 발명을 이용하여 서열조각을 병합하여 표시할 경우 별도의 추가 정보나 추가 작업 없이 서열의 타겟영역의 범위성이나 시퀀싱 에러를 판단할 수 있다는 것을 확인할 수 있었다.As in the analysis result of the sample sequence fragment, it was confirmed that when the sequence fragments are merged and displayed using the present invention, the extent of the target region of the sequence or sequencing error can be determined without additional information or additional work. .

종합적으로, 본원은 2개의 서열을 병합하여 시퀀싱에 이용하기 때문에 시퀀싱 데이터 처리 시간을 절반으로 줄일 수 있으며, 각각 2개의 라인으로 표현되었던 것이 하나로 줄어들기 때문에 데이터의 저장 공간이 감소하며, 추가 정보 없이 병합된 서열 정보만으로 서열의 타겟영역 범위성과 시퀀싱 에러를 판단할 수 있다는 것을 알 수 있었다.Overall, since the present application merges two sequences and uses them for sequencing, the sequencing data processing time can be reduced by half, and since each of the two sequences is reduced to one, the data storage space is reduced, and without additional information. It was found that target region coverage and sequencing errors of sequences could be determined only with the merged sequence information.

전술한 본원의 설명은 예시를 위한 것이며, 본원이 속하는 기술분야의 통상의 지식을 가진 자는 본원의 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 쉽게 변형이 가능하다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다. 예를 들어, 단일형으로 설명되어 있는 각 구성 요소는 분산되어 실시될 수도 있으며, 마찬가지로 분산된 것으로 설명되어 있는 구성 요소들도 결합된 형태로 실시될 수 있다.The above description of the present application is for illustrative purposes, and those skilled in the art will understand that it can be easily modified into other specific forms without changing the technical spirit or essential features of the present application. Therefore, the embodiments described above should be understood as illustrative in all respects and not limiting. For example, each component described as a single type may be implemented in a distributed manner, and similarly, components described as distributed may be implemented in a combined form.

Claims

In the method of merging paired sequence fragments for next-generation sequencing analysis,
(a) arranging the two paired sequence pieces to identify overlapping portions;
(b) merging the two sequence fragments based on the overlapping portion; and
(c) displaying the merged sequence fragments;
In the displaying step,
If there is a part where the base information of the target lead sequence is unknown among the two sequences, the base of the reference sequence is displayed as it is, but in lowercase letters,
If the bases of the overlapping part of the two sequences are different, the part with different bases is indicated by a lowercase letter n,
A method for merging paired sequence fragments for next-generation sequencing analysis, characterized in that when there is no base in the overlapping portion of the two sequences, the base of the reference sequence is displayed as it is but in lower case.

A method for analyzing paired sequence fragment merge marks for next-generation sequencing analysis, wherein the merge marked sequence fragments according to claim 1 are analyzed.

According to claim 2,
A method for interpreting paired sequence fragment merge marks for next-generation sequencing analysis, in which the higher the number of bases indicated by lowercase letters in the merged sequence fragments, the narrower the target region sequencing range is.

According to claim 2,
A method for interpreting paired sequence fragment merge marks for next-generation sequencing analysis, in which the greater the number of lower case n in the merge marked sequence fragment, the greater the sequencing error.