KR102482668B1

KR102482668B1 - A method for improving the labeling accuracy of Unique Molecular Identifiers

Info

Publication number: KR102482668B1
Application number: KR1020200029799A
Authority: KR
Inventors: 정종석; 박웅양
Original assignee: 사회복지법인 삼성생명공익재단
Priority date: 2020-03-10
Filing date: 2020-03-10
Publication date: 2022-12-29
Also published as: KR20210114279A

Abstract

고유 분자 식별자의 표지 정확도를 증진하는 방법, 고유 분자 식별자의 표지 정확도가 증진된 핵산서열 분석용 어댑터를 제조하는 방법에 관한 것이다.A method of improving the labeling accuracy of a unique molecular identifier and a method of manufacturing an adapter for nucleic acid sequence analysis with improved labeling accuracy of a unique molecular identifier.

Description

A method for improving the labeling accuracy of Unique Molecular Identifiers}

유전체 또는 게놈(genome)이란 한 생물이 가지는 모든 유전 정보를 말한다. 어느 한 개인의 유전체의 시퀀싱(sequencing) 또는 서열분석을 위하여, DNA 칩 및 차세대 서열분석 (Next Generation Sequencing; NGS), 차차세대 서열분석(Next Next Generation Sequencing; NNGS) 등 여러 기술들이 개발되고 있다. NGS는 연구 및 진단의 목적으로 널리 활용되고 있다. NGS는 장비의 종류에 따라 다르지만, 크게 보면 시료의 채취, 라이브러리의 제작, 및 핵산 서열분석의 수행의 총 3단계로 구분할 수 있다. 핵산 서열분석 후에는 생산된 서열분석 데이터에 기초하여, 유전자 변이 여부를 검출한다.A genome or genome is all genetic information of an organism. For sequencing or sequencing of an individual's genome, various technologies such as DNA chips, Next Generation Sequencing (NGS), and Next Next Generation Sequencing (NNGS) have been developed. NGS is widely used for research and diagnostic purposes. Although NGS varies depending on the type of equipment, it can be broadly classified into three steps: sample collection, library preparation, and nucleic acid sequencing. After nucleic acid sequencing, gene mutation is detected based on the generated sequencing data.

다수의 시료를 동시에 분석하기 위하여, 하나의 핵산 서열분석 장비에 다수의 시료를 혼합하여 투입할 수 있다. 이 경우, 혼합되는 검체는 혼합 전 각 시료를 구별하여 나타낼 수 있는 표지를 가져야 한다. 상기 표지는 중합효소 연쇄 반응 중 중합효소가 야기하는 오류, 및/또는 핵산 서열분석 과정 중에서의 감지의 오류 등으로 인하여 핵산 서열분석 결과에서의 오차를 야기할 수 있는데, 이러한 오차는 변이의 검출을 저해하는 문제점이 있다. 따라서, 다수의 시료를 구별할 수 있는 표지가 분석 대상 시료에 바르게 결합하여, 해당 시료를 정확하게 표지하는 방법이 요구된다.In order to simultaneously analyze a plurality of samples, a plurality of samples may be mixed and introduced into one nucleic acid sequencing device. In this case, samples to be mixed must have a label that can distinguish each sample before mixing. The label may cause errors in nucleic acid sequencing results due to errors caused by the polymerase during the polymerase chain reaction and/or detection errors during the nucleic acid sequencing process, and these errors make it difficult to detect mutations. There are problems that hinder it. Therefore, there is a need for a method of accurately labeling a sample by properly binding a label capable of distinguishing a plurality of samples to a sample to be analyzed.

한국등록특허 제1977976호Korea Patent No. 1977976

고유 분자 식별자의 표지 정확도를 증진하는 방법, 고유 분자 식별자의 표지 정확도가 증진된 핵산서열 분석용 어댑터를 제조하는 방법을 제공하는 것을 목적으로 한다.It is an object of the present invention to provide a method of improving labeling accuracy of a unique molecular identifier and a method of manufacturing an adapter for nucleic acid sequence analysis with improved labeling accuracy of a unique molecular identifier.

일 양태로서, 양 말단이 아데노신(Adenosine, A), 티민(Thymine, T) 및 시토신(Cytosine, C)으로 이루어진 군에서 선택된 어느 하나인 올리고뉴클레오티드 서열로 이루어진 고유 분자 식별자(Unique Molecular Identifier; UMI)를 포함하는 핵산서열 분석용 어댑터를 분석 대상 핵산에 연결(ligation)하는 단계를 포함하는, 고유 분자 식별자의 표지 정확도를 증진하는 방법을 제공한다.In one embodiment, a unique molecular identifier (UMI) consisting of an oligonucleotide sequence in which both ends are any one selected from the group consisting of adenosine (A), thymine (T), and cytosine (Cytosine, C) It provides a method for improving labeling accuracy of a unique molecular identifier, comprising the step of ligation of a nucleic acid sequence analysis adapter comprising a nucleic acid to be analyzed.

상기 고유 분자 식별자는 고유 분자 색인(Unique Molecular Index)으로도 언급될 수 있는데, 개별 핵산 분자들을 서로 구별하기 위해 사용될 수 있는, 핵산 분자에서 적용된 또는 확인된 뉴클레오티드의 서열이다. 고유 분자 식별자는 분석 대상 서열들이 한 공급원 핵산 분자의 것인지 또는 다른 공급원 핵산 분자의 것인지를 측정하기 위하여 연계되는 핵산 분자와 함께 서열분석될 수 있다. 고유 분자 식별자는 통상적으로 한 샘플의 분석을 다른 샘플의 분석과 구별하기 위해 사용되는 바코드에 유사하지만, 고유 분자 식별자는 대신 많은 핵산 분자가 함께 서열분석될 때 한 공급원 핵산 분자를 다른 것들과 구별하기 위해 사용된다.The unique molecular identifier, also referred to as a Unique Molecular Index, is a sequence of applied or identified nucleotides in a nucleic acid molecule that can be used to distinguish individual nucleic acid molecules from one another. Unique molecular identifiers can be sequenced along with linked nucleic acid molecules to determine whether the sequences of interest are from one source nucleic acid molecule or from another source nucleic acid molecule. Unique molecular identifiers are similar to barcodes that are commonly used to distinguish analysis of one sample from analysis of another, but unique molecular identifiers are instead used to distinguish one source nucleic acid molecule from others when many nucleic acid molecules are sequenced together. used for

상기 고유 분자 식별자는 이를 이루고 있는 올리고뉴클레오티드 서열의 양 말단이 아데노신(A), 티민(T) 및 시토신(C)으로 이루어진 군에서 선택된 어느 하나인 것으로서, 양 말단이 구아닌(Guanine, G)이 아닌 것을 의미하는 것일 수 있는데, 도 1에서 확인할 수 있듯, 종래 알려진 고유 분자 식별자를 이용하여 표지를 수행하였으나 오표지(mis-label)된 경우 그 올리고뉴클레오티드 내 구아닌의 비율이 높은 것을 발견함에 따라, 이를 양 말단에서 제거하여 표지 정확도를 현저히 상승시키고자 하는 목적일 수 있다.The unique molecular identifier is any one selected from the group consisting of adenosine (A), thymine (T), and cytosine (C) at both ends of the oligonucleotide sequence constituting the unique molecular identifier, and both ends are not Guanine (G). As can be seen in FIG. 1, labeling was performed using a conventionally known unique molecular identifier, but when it was mislabeled, it was found that the ratio of guanine in the oligonucleotide was high. It may be the purpose of significantly increasing labeling accuracy by removing it from both ends.

상기 어댑터는 당업계에서 핵산서열 분석에 사용되는 통상의 어댑터 구조를 갖는 것일 수 있는데, 구체적으로 도 2에 도시된 어댑터의 형태일 수 있다. 보다 구체적으로, 상기 어댑터는 Illumina 사의 고유 분자 식별자가 도입된 어댑터 구조일 수 있는데, 부분적으로 이중가닥이고 2개의 가닥에 해당하는 2개의 올리고뉴클레오타이드를 어닐링함으로써 형성되는 것일 수 있다. 상기 2개의 가닥은 서열분석 대상 핵산단편과 연결될 단부에서 2개의 올리고뉴클레오타이드가 어닐링되는 것을 허용하는 약간의 상보적인 염기쌍(예컨대 12 내지 17 bp)을 가질 수 있다. 다른 염기쌍은 2 가닥에서 상보적이지 않으며, 그 결과 2개의 돌출부를 가진 포크형상의 어댑터가 도출될 수 있다.The adapter may have a conventional adapter structure used in nucleic acid sequence analysis in the art, and may be specifically in the form of the adapter shown in FIG. 2 . More specifically, the adapter may be an adapter structure into which Illumina's unique molecular identifier is introduced, and may be partially double-stranded and formed by annealing two oligonucleotides corresponding to the two strands. The two strands may have slightly complementary base pairs (eg, 12 to 17 bp) allowing for annealing of the two oligonucleotides at the ends to be linked to the nucleic acid fragment to be sequenced. Other base pairs are not complementary in the two strands, resulting in a forked adapter with two protrusions.

상기 고유 분자 식별자는 당업계에 널리 알려진 방법을 이용하여 핵산서열 분석용 어댑터 내에 포함시킬 수 있는데, 예를 들면, 중합효소, 엔도뉴클라아제, 전위효소 등을 이용한 연결 또는 전위에 의해, 물리적으로 연결 또는 결합시켜 포함시킬 수 있다.The unique molecular identifier may be included in an adapter for nucleic acid sequence analysis using a method widely known in the art. It can be included by linking or combining.

상기 어댑터는 전체가 아데노신(A), 티민(T) 및 시토신(C)로 이루어진 군에서 선택된 어느 하나인 올리고뉴클레오티드 서열로 이루어진 고유 분자 식별자를 포함하는 것일 수 있는데, 이러한 경우, 고유 분자 식별자를 이루는 올리고뉴클레오티드 내 구아닌을 하나도 포함하지 않아, 고유 분자 식별자의 오표지의 절대량을 현저히 감소시킬 수 있다는 장점이 존재할 수 있다.The adapter may include a unique molecular identifier consisting entirely of an oligonucleotide sequence selected from the group consisting of adenosine (A), thymine (T), and cytosine (C). In this case, the unique molecular identifier Since no guanine is included in the oligonucleotide, there may be an advantage in that the absolute amount of false labeling of the unique molecular identifier can be significantly reduced.

상기 고유 분자 식별자는 2 내지 40, 2 내지 35, 2 내지 30, 2 내지 25, 2 내지 20, 2 내지 15, 2 내지 10, 2 내지 6, 2 내지 5 또는 3 내지 5 뉴클레오티드 길이일 수 있고, 적절한 길이를 확보하여 표지의 정확도를 증진시킨다는 측면에서 바람직하게는 2 내지 6 또는 3 내지 5 뉴클레오티드 길이일 수 있으나, 반드시 이에 제한되는 것은 아니다.The unique molecular identifier can be 2 to 40, 2 to 35, 2 to 30, 2 to 25, 2 to 20, 2 to 15, 2 to 10, 2 to 6, 2 to 5 or 3 to 5 nucleotides in length, It may preferably be 2 to 6 or 3 to 5 nucleotides in length in terms of securing an appropriate length to improve labeling accuracy, but is not necessarily limited thereto.

상기 고유 분자 식별자를 포함하는 어댑터와 분석 대상 핵산을 연결(ligation)하는 단계에 있어서, 분석 대상 핵산이 포함되어 있는 라이브러리의 농도 대비 상기 고유 분자 식별자를 포함하는 어댑터의 농도를 설정하는 단계를 포함할 수 있다. 이러한 경우, 라이브러리 농도 대비 어댑터의 농도를 설정함에 따라 고유 분자 식별자의 오표지율을 조절하거나, 더욱 현저히 낮출 수 있는 효과가 있다. 구체적으로, 라이브러리의 농도 대비 어댑터의 농도를 10²내지10⁵, 10³내지10⁵, 10⁴내지 10⁵, 10²내지 10⁴ 또는 10³내지10⁴배로 설정할 수 있고, 바람직한 일 실시예에 따를 때, 10³내지 10⁵배로 설정할 수 있다.In the step of ligation between the adapter including the unique molecular identifier and the nucleic acid to be analyzed, the step of setting the concentration of the adapter including the unique molecular identifier compared to the concentration of the library containing the nucleic acid to be analyzed can In this case, by setting the concentration of the adapter compared to the concentration of the library, there is an effect of controlling or significantly lowering the mislabeling rate of the unique molecular identifier. Specifically, the concentration of the adapter compared to the concentration of the library is 10²pay10⁵, 10³pay10⁵, 10⁴to 10⁵, 10²to 10⁴ or 10³pay10⁴It can be set to double, and according to a preferred embodiment, 10³to 10⁵Can be set to double.

상기 방법에 있어서, 상기 고유 분자 식별자를 포함하는 어댑터를 분석 대상 핵산시료에 연결하는 단계 후, 잔여 어댑터를 제거하는 정제(purification) 단계를 더 포함할 수 있다. 이러한 경우, 어댑터보다 더 길이가 긴 라이브러리의 정제 비율을 높여 강도(stringency)를 높일 수 있고, 고유 분자 식별자의 오표지율을 더욱 현저히 낮출 수 있는 효과가 있다. 정제 단계의 조건은 라이브러리의 농도와 어댑터의 농도를 고려하여 실시자가 적절히 설정할 수 있고, 구체적은 예시로서, 자성 비드(예를 들면, SPRI(Solid Phase Reversible Immobilization) 자성 비드)를 이용한 정제의 경우, 1.8X보다 낮게 자성의 부피 비율을 설정하거나, 정제 과정을 1 내지 5회, 1 내지 4회, 1 내지 3회 또는 2 내지 4회 실시할 수 있다. 구체적인 실시예로서, 정제 과정을 2회 실시한 경우, 1회 실시한 경우 대비 고유 분자 식별자의 오표지율이 현저히 낮음을 확인할 수 있다.The method may further include a purification step of removing residual adapters after linking the adapter including the unique molecular identifier to the nucleic acid sample to be analyzed. In this case, the stringency can be increased by increasing the purification ratio of the library longer than the adapter, and the mislabeling rate of the unique molecular identifier can be more remarkably lowered. The conditions of the purification step can be appropriately set by the practitioner in consideration of the concentration of the library and the adapter, and as a specific example, in the case of purification using magnetic beads (eg, Solid Phase Reversible Immobilization (SPRI) magnetic beads), The volume ratio of the magnet may be set lower than 1.8X, or the purification process may be performed 1 to 5 times, 1 to 4 times, 1 to 3 times, or 2 to 4 times. As a specific example, it can be confirmed that when the purification process is performed twice, the false labeling rate of the unique molecular identifier is significantly lower than when the purification process is performed once.

상기 핵산서열 분석은 고유 분자 식별자를 포함하는 어댑터를 사용하는 분석방법이라면 어떠한 것이든 선택할 수 있고, 특별히 제한되지 아니하며, 구체적인 예로서는 차세대 핵산 서열분석(Next Generation Sequencing)일 수 있다.The nucleic acid sequence analysis may be any analysis method using an adapter including a unique molecular identifier, and is not particularly limited, and a specific example may be next generation nucleic acid sequencing.

일 양태로서, 양 말단이 아데노신(A), 티민(T) 및 시토신(C)으로 이루어진 군에서 선택된 어느 하나인 올리고뉴클레오티드 서열로 이루어진 고유 분자 식별자(Unique Molecular Identifier)를 핵산서열 분석용 어댑터에 포함시키는 단계를 포함하는, 고유 분자 식별자의 표지 정확도가 증진된 핵산서열 분석용 어댑터를 제조하는 방법을 제공한다.In one embodiment, a unique molecular identifier consisting of an oligonucleotide sequence at both ends of which is any one selected from the group consisting of adenosine (A), thymine (T) and cytosine (C) is included in the adapter for nucleic acid sequence analysis Provided is a method for preparing an adapter for nucleic acid sequence analysis with enhanced labeling accuracy of a unique molecular identifier, comprising the step of making.

상기 방법은 전체가 아데노신(A), 티민(T) 및 시토신(C)로 이루어진 군에서 선택된 어느 하나인 올리고뉴클레오티드 서열로 이루어진 고유 분자 식별자를 핵산서열 분석용 어댑터에 포함시키는 단계를 포함할 수 있다.The method may include a step of including a unique molecular identifier entirely composed of an oligonucleotide sequence selected from the group consisting of adenosine (A), thymine (T) and cytosine (C) in an adapter for nucleic acid sequencing. .

상기 고유 분자 식별자, 이의 길이, 이를 어댑터에 포함하는 방법, 핵산서열 분석 등에 관련된 구체적인 설명은 상술한 바와 같다.The specific description related to the unique molecular identifier, its length, method of including it in an adapter, nucleic acid sequence analysis, etc. is as described above.

상기 방법에 의해 제조되는 핵산서열 분석용 어댑터는 특이서열을 갖도록 설계된 올리고뉴클레오티드로 이루어진 고유 분자 식별자를 포함하고 있어, 서열 분석시 표지능이 종래기술 대비 현저히 높고, 오표지율이 현저히 낮다는 장점이 있다.The adapter for nucleic acid sequencing prepared by the above method includes a unique molecular identifier made of oligonucleotides designed to have a specific sequence, and thus has the advantage of significantly higher labeling ability and significantly lower false labeling rate than the prior art during sequencing. .

일 양태로서 제공되는 고유 분자 식별자(Unique Molecular Identifier)의 표지 정확도를 증진하는 방법은, 고유 분자 식별자를 구성하는 올리고뉴클레오티드의 양 말단에 구아닌(G)을 포함하지 않아 표지의 정확도를 현저히 향상시킬 수 있다. 이로써, 고유 분자 식별자를 활용한 중복된 리드를 제거할 수 있고, 고유 분자 식별자의 본래 목적인 고유 분자를 확보하는 데에 있어 그 효율을 현저히 향상시킬 수 있다.A method for improving the labeling accuracy of a unique molecular identifier provided as an aspect can significantly improve the labeling accuracy by not including guanine (G) at both ends of oligonucleotides constituting the unique molecular identifier. there is. In this way, it is possible to remove redundant reads using the unique molecular identifier, and significantly improve the efficiency in securing a unique molecule, which is the original purpose of the unique molecular identifier.

도 1은 고유 분자 식별자의 표지가 잘못된 경우에 있어서, 상기 고유 분자 식별자를 구성하는 올리고뉴클레오티드 서열 내 구아닌의 함량을 나타낸 것이다.
도 2는 일 양태로서 제공하는 고유 분자 식별자의 표지 정확도가 증진된 어댑터를 도식화한 것으로서, (1)은 전체에 구아닌이 없는 경우(HHHH), (2)는 양 말단에 구아닌이 없는 경우(HNNH), (3)은 양 말단에 구아닌을 포함하는 4가지 뉴클레오티드가 모두 존재할 수 있는 경우(NNNN)를 나타낸 것이다(H: 구아닌이 아닌 3개의 뉴클레오티드 중 하나, N: 구아닌을 포함하는 4가지 뉴클레오티드 중 하나).
도 3, 4는 도 2의 어댑터를 각각 이용하여 서열분석 대상 핵산시료에 표지를 수행한 후, 그 표지 정확도를 확인한 결과이다.1 shows the content of guanine in the oligonucleotide sequence constituting the unique molecular identifier when the labeling of the unique molecular identifier is incorrect.
Figure 2 is a schematic diagram of an adapter with improved labeling accuracy of a unique molecular identifier provided as an aspect, (1) when there is no guanine throughout (HHHH), (2) when there is no guanine at both ends (HNNH ), (3) shows the case where all four nucleotides containing guanine can exist at both ends (NNNN) (H: one of three nucleotides other than guanine, N: one of four nucleotides containing guanine one).
3 and 4 show the results of confirming the labeling accuracy after labeling a nucleic acid sample to be sequenced using the adapters of FIG. 2, respectively.

이하, 보다 구체적으로 설명하기 위해 실시예를 들어 상세하게 설명하기로 한다. Hereinafter, an example will be described in detail for a more detailed explanation.

실시예 1. 고유 분자 식별자를 포함하는 어댑터의 제작Example 1. Fabrication of adapters containing unique molecular identifiers

어닐링 버퍼(annealing buffer)로서, 1X TE (10 mM Tris pH 8.0, 1 mM EDTA) 9.9ml와 5 M NaCl 0.1ml을 혼합하여 제조하였다. 고유 분자 식별자를 포함하는 어댑터로서, 상기 어닐링 버퍼 50ul, 100uM 범용 어댑터 올리고뉴클레오티드(universal adaptor oligo) 25ul 및 100uM 고유 분자 식별자가 삽입된 어댑터 올리고뉴클레오티드(UMI embedded adaptor oligo) 25ul를 혼합하여 제조하였다. 상기 어댑터가 포함된 용액을 서포사이클러(thermocycler)를 이용하여 95C 1분, -0.1C/초 온도 기울기(gradient temperature) 조건으로 800초, 14C 종료하여 어댑터를 어닐링하였다. 최종적으로 제조된 어댑터의 서열은 하기 표 1에 나타내었다.As an annealing buffer, it was prepared by mixing 9.9 ml of 1X TE (10 mM Tris pH 8.0, 1 mM EDTA) and 0.1 ml of 5 M NaCl. As an adapter including a unique molecular identifier, it was prepared by mixing 50ul of the annealing buffer, 25ul of 100uM universal adapter oligonucleotide, and 25ul of UMI embedded adapter oligonucleotide with 100uM unique molecular identifier. The solution containing the adapter was annealed by using a thermocycler at 95 C for 1 minute and 800 seconds at 14 C under conditions of -0.1 C/sec gradient temperature. The sequence of the finally prepared adapter is shown in Table 1 below.

어댑터adapter 서열order UMI:NNNNUMI:NNNN /5Phos/GAGAGATCGGAAGAGCACACGTCTGAACTCCAGTCACCACGNNNNATCTCGTATGCCGTCTTCTGCTTG/5Phos/GAGAGATCGGAAGAGCACACGTCTGAACTCCAGTCACCACGNNNNATCTCGTATGCCGTCTTCTGCTTG UMI:HNNHUMI:HNNH /5Phos/GAGAGATCGGAAGAGCACACGTCTGAACTCCAGTCACCACGNHHNATCTCGTATGCCGTCTTCTGCTTG/5Phos/GAGAGATCGGAAGAGCACACGTCTGAACTCCAGTCACCACGNHHNATCTCGTATGCCGTCTTCTGCTTG UMI:HHHHUMI:HHHH /5Phos/GAGAGATCGGAAGAGCACACGTCTGAACTCCAGTCACCACGHHHHATCTCGTATGCCGTCTTCTGCTTG/5Phos/GAGAGATCGGAAGAGCACACGTCTGAACTCCAGTCACCACGHHHHATCTCGTATGCCGTCTTCTGCTTG 범용어댑터universal adapter AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCTGACTAATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCTGACT

실시예 2. 고유 분자 식별자의 서열 변화에 의한 표지 정확도 분석Example 2. Labeling Accuracy Analysis by Sequence Variation of Unique Molecular Identifiers

1. 실험방법1. Experiment method

200ng의 셀라인 genomic DNA (NA12878)를 사용하여, 양 3' 말단에 아데닐화(adenylation)를 완료한 라이브러리와, 고유 분자 식별자를 포함하는 어댑터(25μM stock)을 제조사 반응 조건에 따라(라이브러리 제작 키트: KAPA hyper library prep kit for Illumina) 하기 표 2의 반응용액을 조성하였다. 반응조건은 4C, overnight으로 설정하였고, 이 후, SPRI 비드를 이용하여 정제과정을 수행하였다.Using 200ng of cell line genomic DNA (NA12878), a library with adenylation at both 3' ends and an adapter (25μM stock) containing a unique molecular identifier were prepared according to the manufacturer's reaction conditions (library production kit : KAPA hyper library prep kit for Illumina) The reaction solutions shown in Table 2 below were prepared. The reaction conditions were set to 4C overnight, and then, a purification process was performed using SPRI beads.

성분ingredient 부피(μl)Volume (μl) A-tailing reaction productA-tailing reaction product 6060 Adapter stockAdapter stock 55 PCR-grade waterPCR-grade water 55 Ligation BufferLigation Buffer 3030 DNA LigaseDNA Ligase 1010 Total volumeTotal volume 110110

분석 대상 핵산으로서 사용된 합성 핵산 단편(artificial fragment)은 하기 표 3에 나타내었다. 모든 합성 핵산 단편은 인간 게놈과 매칭되는 부분의 5' 말단에 CTTC, 3' 말단에 GAAG를 포함하도록 제작하여, 이 후 raw 데이터(fastaq)에서 추출이 가능하도록 제작하였다. 추출방법으로, 해당서열의 5' 부분부터 50bp 의 시퀀스를 동일하게 가지고 있는 시퀀서를 fastq file에서 추출하였다. The synthetic nucleic acid fragments used as nucleic acids to be analyzed are shown in Table 3 below. All synthetic nucleic acid fragments were prepared to include CTTC at the 5' end and GAAG at the 3' end of the part matching the human genome, and then extracted from raw data (fastaq). As an extraction method, a sequencer having the same sequence of 50 bp from the 5' part of the sequence was extracted from the fastq file.

Reference positionReference position Artificial fragment sequenceArtificial fragment sequence 1One KRAS : Chr12 :

Exon number: 3
chr12:25380168-25380346KRAS: Chr12:

Exon number: 3
chr12:25380168-25380346 CTTCATCCTGAGAAGGGAGAAACACAGTCTGGATTATTACAGTGCACCTTTTACTTCAAAAAAGGTGTTATATACAACTCAACAACAAAAAATTCAATTTAAAAATGGGCAAAGGACTTGAAAAGACATTGTTCCTGCTCCAAAGAGAAG CTTC ATCCTGAGAAGGGAGAAACACAGTCTGGATTATTACAGTGCACCTTTTACTTCAAAAAAGGTGTTATATACAACTCAACAACAAAAAATTCAATTTAAAAATGGGCAAAGGACTTGAAAAGACATTGTTCCTGCTTCCAAAGA GAAG 22 IDH1 : Chr2

Exon number: 4 :
chr2:209113048-209113359IDH1: Chr2

Exon number: 4 :
chr2:209113048-209113359 CTTCAATGGCTTCTCTGAAGACCGTGCCACCCAGAATATTTCGTATGGTGCCATTTGGTGATTTCCACATTTGTTTCAACTTGAACTCCTCAACCCTCTTCTCATCAGGAGTGATAGTGGCACATTTGACGCCAACATTATGCTTCGAAG CTTC AATGGCTTCTCTGAAGACCGTGCCACCCAGAATATTTCGTATGGTGCCATTTGGTGATTCCACATTTGTTTCAACTTGAACTCCTCAACCCTCTTCTCATCAGGAGTGATAGTGGCACATTTGACGCCAACATTATGCTTC GAAG 33 BRAC1 : Chr17

Exon number: 15 :

chr17:41222945-41223255BRAC1: Chr17

Exon number: 15 :

chr17:41222945-41223255 CTTCTTCTGGCTTCTCCCTGCTCACACTTTCTTCCATTGCATTATACCCAGCAGTATCAGTAGTATGAGCAGCAGCTGGACTCTGGGCAGATTCTGCAACTTTCAACTTTCAATTGGGGAACTTTCAATGCAGAGGTTGAAGATGGGAAG CTTC TTCTGGCTTCTCCCTGCTCACACTTTCTTCCATTGCATTATACCCAGCAGTATCAGTAGTATGAGCAGCAGCTGGACTCTGGGCAGATTCTGCAACTTTCAACTTTCAATTGGGGAACTTTCAATGCAGAGGTTGAAGATGG GAAG 44 ALK : Chr2

Exon number: 20:

chr2:29446208-29446394ALK:Chr2

Exon number: 20:

chr2:29446208-29446394 CTTCACTGATGGAGGAGGTCTTGCCAGCAAAGCAGTAGTTGGGGTTGTAGTCGGTCATGATGGTCGAGGTGCGGAGCTTGCTCAGCTTGTACTCAGGGCTCTGCAGCTCCATCTGCATGGCTTGCAGCTCCTGGTGCTTCCGGCGGGAAG GAAG _ 55 ERBB2 : Chr17

Exon number: 6:

chr17:37864574-37864787ERBB2: Chr17

Exon number: 6:

chr17:37864574-37864787 CTTCGCTACGTGCTCATCGCTCACAACCAAGTGAGGCAGGTCCCACTGCAGAGGCTGCGGATTGTGCGAGGCACCCAGCTCTTTGAGGACAACTATGCCCTGGCCGTGCTAGACAATGGAGACCCGCTGAACAATACCACCCCTGTGAAG CTTC GCTACGTGCTCATCGCTCACAACCAAGTGAGGCAGGTCCCACTGCAGAGGCTGCGGATTGTGCGAGGCACCCAGCTCTTTGAGGACAACTATGCCCTGGCCGTGCTAGACAATGGAGACCCGCTGAACAATACCACCCCTGT GAAG

2. 실험결과2. Experimental results

어댑터의 연결조건 및 라이브러리 정제조건 별로 'G' 시퀀스를 제거한 'HHHH' UMI 어댑터에서는 오표지된 UMI내에 'G' 시퀀스의 비율이 없는 것으로 나타났고, 부분적으로 양 끝의 'G' 시퀀스를 제거한 UMI 어댑터를 사용하였을 때는 그 비율이 NNNN의 UMI보다 낮은 것을 확인할 수 있었다. 또한, 어댑터의 연결조건 및 라이브러리 정제조건에 따라 오표지된 UMI 리드의 수가 상관성 있게 변하는 것을 관찰하였다(도 3, 4).'HHHH' UMI with 'G' sequence removed according to adapter connection conditions and library purification conditions. In the adapter, there was no ratio of 'G' sequence in the mislabeled UMI, and UMI with the 'G' sequence partially removed from both ends. When the adapter was used, it was confirmed that the ratio was lower than the UMI of NNNN. In addition, it was observed that the number of mislabeled UMI leads correlated with the adapter ligation conditions and library purification conditions (FIGS. 3 and 4).

이를 최종적으로 오표지된 리드의 비율을 조건별로 살펴보면, 'HNNH' UMI(부분적으로 양 끝의 'G' 시퀀스 제거)를 사용하고, 라이브러리 농도 대비 1/10000배의 어댑터 농도(최종 어댑터의 농도: 136nM)로 4C 오버나이트 조건으로 연결하며, 연결 후 정제 대상 용액의 부피의 1.8X SPRI(Solid Phase Reversible Immobilization) 정제 비드 조건으로 2회 정제한 경우 UMI 시퀀스가 정확하게 표지되는 비율이 가장 높았다.Looking at the ratio of the finally mislabeled reads conditionally, using 'HNNH' UMI (partially removing 'G' sequences at both ends), adapter concentration 1/10000 times the library concentration (final adapter concentration: 136 nM) at 4C overnight, and after linking, the UMI sequence was correctly labeled at the highest rate when purified twice under the condition of 1.8X SPRI (Solid Phase Reversible Immobilization) purification beads in the volume of the solution to be purified.

<110> SAMSUNG LIFE PUBLIC WELFARE FOUNDATION <120> A method for improving the labeling accuracy of Unique Molecular Identifiers <130> PN129687KR <160> 9 <170> KoPatentIn 3.0 <210> 1 <211> 69 <212> DNA <213> Artificial Sequence <220> <223> UMI:NNNN <400> 1 gagagatcgg aagagcacac gtctgaactc cagtcaccac gnnnnatctc gtatgccgtc 60 ttctgcttg 69 <210> 2 <211> 69 <212> DNA <213> Artificial Sequence <220> <223> UMI:HNNH <400> 2 gagagatcgg aagagcacac gtctgaactc cagtcaccac gnhhnatctc gtatgccgtc 60 ttctgcttg 69 <210> 3 <211> 69 <212> DNA <213> Artificial Sequence <220> <223> UMI:HHHH <400> 3 gagagatcgg aagagcacac gtctgaactc cagtcaccac ghhhhatctc gtatgccgtc 60 ttctgcttg 69 <210> 4 <211> 62 <212> DNA <213> Artificial Sequence <220> <223> Universal Adaptor <400> 4 aatgatacgg cgaccaccga gatctacact ctttccctac acgacgctct tccgatctga 60 ct 62 <210> 5 <211> 150 <212> DNA <213> Artificial Sequence <220> <223> KRAS chr12:25380168-25380346 <400> 5 cttcatcctg agaagggaga aacacagtct ggattattac agtgcacctt ttacttcaaa 60 aaaggtgtta tatacaactc aacaacaaaa aattcaattt aaaaatgggc aaaggacttg 120 aaaagacatt gttcctgctc caaagagaag 150 <210> 6 <211> 150 <212> DNA <213> Artificial Sequence <220> <223> IDH1 chr2:209113048-209113359 <400> 6 cttcaatggc ttctctgaag accgtgccac ccagaatatt tcgtatggtg ccatttggtg 60 atttccacat ttgtttcaac ttgaactcct caaccctctt ctcatcagga gtgatagtgg 120 cacatttgac gccaacatta tgcttcgaag 150 <210> 7 <211> 150 <212> DNA <213> Artificial Sequence <220> <223> BRAC1 chr17:41222945-41223255 <400> 7 cttcttctgg cttctccctg ctcacacttt cttccattgc attataccca gcagtatcag 60 tagtatgagc agcagctgga ctctgggcag attctgcaac tttcaacttt caattgggga 120 actttcaatg cagaggttga agatgggaag 150 <210> 8 <211> 150 <212> DNA <213> Artificial Sequence <220> <223> ALK chr2:29446208-29446394 <400> 8 cttcactgat ggaggaggtc ttgccagcaa agcagtagtt ggggttgtag tcggtcatga 60 tggtcgaggt gcggagcttg ctcagcttgt actcagggct ctgcagctcc atctgcatgg 120 cttgcagctc ctggtgcttc cggcgggaag 150 <210> 9 <211> 150 <212> DNA <213> Artificial Sequence <220> <223> ERBB2 chr17:37864574-37864787 <400> 9 cttcgctacg tgctcatcgc tcacaaccaa gtgaggcagg tcccactgca gaggctgcgg 60 attgtgcgag gcacccagct ctttgaggac aactatgccc tggccgtgct agacaatgga 120 gacccgctga acaataccac ccctgtgaag 150 <110> SAMSUNG LIFE PUBLIC WELFARE FOUNDATION <120> A method for improving the labeling accuracy of Unique Molecular Identifiers <130> PN129687KR <160> 9 <170> KoPatentIn 3.0 <210> 1 <211> 69 <212> DNA <213> artificial sequence <220> <223> UMI:NNNN <400> 1 gagagatcgg aagagcacac gtctgaactc cagtcaccac gnnnnatctc gtatgccgtc 60 ttctgcttg 69 <210> 2 <211> 69 <212> DNA <213> artificial sequence <220> <223> UMI:HNNH <400> 2 gagagatcgg aagagcacac gtctgaactc cagtcaccac gnhhnatctc gtatgccgtc 60 ttctgcttg 69 <210> 3 <211> 69 <212> DNA <213> artificial sequence <220> <223> UMI:HHHH <400> 3 gagagatcgg aagagcacac gtctgaactc cagtcaccac ghhhhatctc gtatgccgtc 60 ttctgcttg 69 <210> 4 <211> 62 <212> DNA <213> artificial sequence <220> <223> Universal Adapter <400> 4 aatgatacgg cgaccaccga gatctacact ctttccctac acgacgctct tccgatctga 60 ct 62 <210> 5 <211> 150 <212> DNA <213> artificial sequence <220> <223> KRAS chr12:25380168-25380346 <400> 5 cttcatcctg agaagggaga aacacagtct ggattattac agtgcacctt ttacttcaaa 60 aaaggtgtta tatacaactc aacaacaaaa aattcaattt aaaaatgggc aaaggacttg 120 aaaagacatt gttcctgctc caaagagaag 150 <210> 6 <211> 150 <212> DNA <213> artificial sequence <220> <223> IDH1 chr2:209113048-209113359 <400> 6 cttcaatggc ttctctgaag accgtgccac ccagaatatt tcgtatggtg ccatttggtg 60 atttccacat ttgtttcaac ttgaactcct caaccctctt ctcatcagga gtgatagtgg 120 cacatttgac gccaacatta tgcttcgaag 150 <210> 7 <211> 150 <212> DNA <213> artificial sequence <220> <223> BRAC1 chr17:41222945-41223255 <400> 7 cttcttctgg cttctccctg ctcacacttt cttccattgc attataccca gcagtatcag 60 tagtatgagc agcagctgga ctctgggcag attctgcaac tttcaacttt caattgggga 120 actttcaatg cagaggttga agatgggaag 150 <210> 8 <211> 150 <212> DNA <213> artificial sequence <220> <223> ALK chr2:29446208-29446394 <400> 8 cttcactgat ggaggaggtc ttgccagcaa agcagtagtt ggggttgtag tcggtcatga 60 tggtcgaggt gcggagcttg ctcagcttgt actcagggct ctgcagctcc atctgcatgg 120 cttgcagctc ctggtgcttc cggcgggaag 150 <210> 9 <211> 150 <212> DNA <213> artificial sequence <220> <223> ERBB2 chr17:37864574-37864787 <400> 9 cttcgctacg tgctcatcgc tcacaaccaa gtgaggcagg tcccactgca gaggctgcgg 60 attgtgcgag gcacccagct ctttgaggac aactatgccc tggccgtgct agacaatgga 120 gacccgctga acaataccac ccctgtgaag 150

Claims

An adapter for nucleic acid sequence analysis comprising a unique molecular identifier consisting of an oligonucleotide sequence at both ends of which is any one selected from the group consisting of adenosine (A), thymine (T), and cytosine (C) is used to analyze the nucleic acid to be analyzed. A method for enhancing the labeling accuracy of a unique molecular identifier comprising the step of linking to, wherein the unique molecular identifier is 2 to 6 nucleotides in length.

The method according to claim 1, wherein the adapter comprises a unique molecular identifier entirely composed of an oligonucleotide sequence selected from the group consisting of adenosine (A), thymine (T) and cytosine (C).

delete

The method according to claim 1, wherein the linking comprises setting the concentration of the adapter to 10 ³ to 10 ⁵ times the concentration of the nucleic acid to be analyzed and then linking.

The method according to claim 1, further comprising performing a purification process using magnetic beads 1 to 5 times after the connecting step.

The method of claim 1 , wherein the analysis is Next Generation Sequencing.

Incorporating a unique molecular identifier consisting of an oligonucleotide sequence at both ends of which is any one selected from the group consisting of adenosine (A), thymine (T) and cytosine (C) into an adapter for nucleic acid sequence analysis A method for preparing an adapter for nucleic acid sequence analysis with improved labeling accuracy of a unique molecular identifier,
wherein the unique molecular identifier is 2 to 6 nucleotides in length.

The method according to claim 7, wherein the method comprises the steps of including a unique molecular identifier consisting entirely of an oligonucleotide sequence selected from the group consisting of adenosine (A), thymine (T) and cytosine (C) in an adapter for nucleic acid sequence analysis. How to include.

delete

8. The method of claim 7, wherein said analysis is Next Generation Sequencing.