CN107354209B

CN107354209B - Combinatorial tags, linkers and methods for determining nucleic acid sequences containing low frequency mutations

Info

Publication number: CN107354209B
Application number: CN201710573056.XA
Authority: CN
Inventors: 高晓峘; 曾晓静; 李胜; 张印新; 韩颖鑫; 何哲; 王佳伟; 夏伟成; 蒋馥蔓
Original assignee: Guangzhou Jingke Medical Laboratory Co ltd
Current assignee: Guangzhou Jingke Medical Laboratory Co ltd
Priority date: 2017-07-14
Filing date: 2017-07-14
Publication date: 2021-01-08
Anticipated expiration: 2037-07-14
Also published as: WO2019010776A1; CN107354209A

Abstract

The invention provides a combination label, a joint containing the combination label and a combination thereof, and a method for determining that a target region of a sample to be detected contains a low-frequency mutation nucleic acid sequence. Wherein the combinatorial tag comprises a molecular tag and a library tag, the bases of the molecular tag being arranged across the bases of the library tag. The invention combines the library label and the random molecular label together, utilizes the determined base sequence of the library label for identifying different samples to randomly separate the molecular labels, thereby achieving the purposes of controlling the number of continuous bases, not reducing the variety of the specific molecular label, not additionally increasing the lengths of the two labels and not wasting sequencing data.

Description

Combinatorial tags, linkers and methods for determining nucleic acid sequences containing low frequency mutations

Technical Field

The invention relates to the technical field of nucleic acid sequencing, in particular to a combined label, a joint containing the combined label, a composition of the joint and the composition, and a method for determining that a target region of a sample to be detected contains a low-frequency mutation nucleic acid sequence.

Background

High-throughput sequencing is the sequencing technology with the widest application range at present, but some sequencing errors still can not be avoided in sequencing, the occurrence rate is 0.1-0.2% or higher, and the DNA polymerase used in the PCR process also has the error rate, and the error rate is higherIs 10^-7～10^-5In particular, the error rate increases with the number of PCR cycles.

In order to detect less than 0.1% base mutations (low frequency mutations) or sequencing errors, the authors invented a method of molecular tagging by adding a specific sequence to one or both ends of each sequencing template prior to PCR. Each position of the molecular label can be 1 of A, T, C, G4 bases, the length of the molecular label is selected according to the actual experiment needs, and the molecular label can have 4 n power types according to the length of the molecular label and the change of 4 bases. If the molecular tags of the original templates are completely randomly distributed, the diversity of the molecular tags can ensure that each original template is unique after the molecular tags are connected in the original library, each original template can be used as the original template to form a cluster of 'molecular clusters' in the subsequent PCR process, and if no sequencing error or PCR error exists, the molecular sequences in each cluster are error-free 'copied strands' of the positive strand and the negative strand of the original template.

Theoretically, the base sequence at each position of the molecular tag is completely randomly distributed. However, in the primer synthesis process, when a certain base is synthesized, A, T, C, G four bases are added in equal amount, and the frequency of occurrence of A, T, C, G four bases at each position is not completely equal due to the difference in energy or synthesis efficiency required for synthesis of these four bases. Multiple consecutive identical bases, e.g., 8A, 8G, etc., may be present, resulting in a virtually non-theoretical number of random molecular tag species.

Multiple consecutive bases not only increase the likelihood of sequencing errors, but also increase the proportion of dominant molecule sequences. When different molecular sequences with very similar sequences are linked to the same tag sequence, the skilled person cannot distinguish whether the sequence belongs to a molecule which normally exists, is caused by sequencing error or has low-frequency mutation. Further, molecular cloning where the low frequency mutation is identical to the sequence of normal abundance results in low frequency mutation being missed as a sequencing error or a PCR error. The non-randomness of the molecular tags can reduce their utility and even limit their application. In order to solve the problem, some researchers add a base U, such as NNNUUUNNNUNNN, to the molecular tag to avoid the occurrence of multiple continuous identical bases, which results in low detection effect of the molecular tag, and this method increases the length of the molecular tag, and the U base does not have the function of distinguishing different molecules in the analysis process, i.e. does not have the effect of preparing the molecular tag, so this method not only adds invalid molecular tag length, but also wastes sequencing length, and affects sequencing cost.

Disclosure of Invention

The invention aims to provide a label composition and a detection method, which can effectively control the number of bases of a label and reduce the waste of sequencing data.

The invention provides a combined label, which comprises a molecular label and a library label, wherein the base of the molecular label is arranged in a cross way with the base of the library label.

In another aspect, the invention also provides an adaptor, wherein the adaptor contains the combined label, and the combined label is positioned at any position of the adaptor except 20bp bases at the tail end of the overhang T and the non-overhang.

The invention also provides a method for determining that a target region of a sample to be detected contains a low-frequency mutant nucleic acid sequence, which comprises the following steps:

s1, performing a joint adding reaction on the target region nucleic acid of the sample to be detected by using the joint, and performing PCR amplification on the jointed target region nucleic acid of the sample to be detected to obtain an amplification product, wherein the amplification product forms a target region nucleic acid sequencing library of the sample to be detected;

s2, sequencing the target region nucleic acid sequencing library of the sample to be tested to obtain a sequenced nucleic acid sequence;

s3, classifying the sequenced nucleic acid sequences according to the molecular tags contained in the joints, and classifying the sequenced nucleic acid sequences carrying the same molecular tags into the same nucleic acid sequence set;

s4, comparing the sequenced nucleic acid sequences in the nucleic acid sequence set with each other, and counting the base type and the frequency of each base position in the nucleic acid sequence set;

s5, obtaining a nucleic acid sequence containing a correct base arrangement position in the nucleic acid sequence set by data analysis according to the base type and frequency of each base position in the nucleic acid sequence set;

s6, comparing the nucleic acid sequence containing the correct base sequence position with the rest nucleic acid sequences in the nucleic acid sequence set or the nucleic acid sequences in the parallel nucleic acid sequence set to obtain the nucleic acid sequence containing the low-frequency mutation.

The invention combines the library label and the random molecular label together, utilizes the determined base sequence of the library label for identifying different samples to randomly separate the molecular labels, thereby achieving the purposes of controlling the number of continuous bases, not reducing the variety of the specific molecular label, not additionally increasing the lengths of the two labels and not wasting sequencing data.

Drawings

The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which

FIG. 1 is a flowchart of a method for determining that a target region of a sample contains a low-frequency mutated nucleic acid sequence according to an embodiment of the present invention.

FIG. 2 is a schematic diagram of the structure of a molecular tag in a fully complementary double-stranded linker according to an embodiment of the present invention.

FIG. 3 is a schematic structural diagram of a molecular tag located at a complementary end of a complementary-end-open Y-type linker according to an embodiment of the present invention.

FIG. 4 is a schematic structural diagram of a molecular tag located at an open end in a Y-type linker with a complementary end and an open end according to an embodiment of the present invention.

FIG. 5 is a schematic diagram of a Y-shaped structure in which a molecular tag is not located on a linker but can be introduced into the linker by PCR in an embodiment of the present invention.

Detailed Description

The following describes embodiments of the present invention in detail. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention.

It is to be noted that, in the description of the present invention, "a plurality" means two or more unless otherwise specified.

The invention provides a combined label, which comprises a molecular label and a library label, wherein bases of the library label are arranged with the molecular label in a crossed way.

The library tag is a tag sequence used for identifying different sample libraries in sequencing so as to achieve the aim of sequencing a plurality of libraries together. For example, when the sequencing platform is proton, the library tag used is barcode. When the sequencing platform is illumina, the library tag used is index.

According to a specific embodiment of the invention, every 1-2 bases of the library tag are arranged across every 1-3 bases of the molecular tag. The detailed description is as follows;

first, every 1 base of the library tag is crossed with every 1 base of the molecular tag, and the combined tag has at most 2 continuous identical bases. Reference is made to the following specific examples:

1. when the combined label is AN₂TN₄GN₆CN₈……AN_n-6TN_n-4GN_n-2CN_nFrom left to right, the 1 st, 3 rd, 5 th, 7 th, 9 th.. N-3 th, N-1 th position is the library tag (ATGC … ATGC), and the 2 nd, 4 th, 6 th, 8 th, 10 th.. N-2 th, N th position is the molecular tag (N)₂N₄N₆N₈…N_n-6N_n-4N_n-2N_n)。

The base of the molecular tag is different from the base of the library tag immediately preceding it, e.g. AN₂TN₄GN₆CN₈… … in the formula (I), N₂Instead of A, T, C, G may be used, N₄Instead of T, A, C, G may be any of these.

In the case of 1 defined library tag, the number of combinations of the molecular tags is 3^n/2. For example, when n ═ 16, the length of the library tag is 8bp, and the molecular tag isHas a length of 8bp and a molecular tag sequence combination number of 3⁸＝6561。

2. When the combined label is N₁AN₃TN₅GN₇……CN_n-7AN_n-5TN_n-3GN_n-1C, from left to right, position 2, 4, 6, 8, 10,. and n is the library tag, and position 1, 3, 5, 7, 9,. and n-3, and n-1 is the molecular tag.

The base of the molecular tag is different from the base of the library tag next to it, e.g. N₁AN₃TN₅GN₇… … in the formula (I), N₁Instead of A, T, C, G may be used, N₃Instead of T, A, C, G may be any of these.

In the case of 1 defined library tag, the number of combinations of the molecular tags is 3^n/2. For example, when n is 16, the length of the library tag is 8bp, the length of the molecular tag is 8bp, and the number of combinations of the molecular tag sequences is 3⁸＝6561。

3. When the combined label is AN₂TN₄GN₆CN₈……AN_n-7TN_n-5GN_n-3CN_n-1A, from left to right, position 1, 3, 5, 7, 9,. n-2, n is the library tag and position 2, 4, 6, 8, 10,. n-1 is the molecular tag.

In the case of 1 defined library tag, the number of combinations of the molecular tags is 3^(n-1)/2. For example, when n is 17, the length of the library tag is 9bp, the length of the molecular tag is 8bp, and the number of combinations of the molecular tag sequences is 3⁸＝6561。

4. When the combined label is N₁AN₃TN₅GN₇……CN_n-8AN_n-6TN_n-4GN_n-2CN_nFrom left to right, position 2, 4, 6, 8, 10,. n-1 is the library tag and position 1, 3, 5, 7, 9,. n-2, n is the molecular tag.

In the case of 1 defined library tag, the number of combinations of the molecular tags is 3^(n+1)/2. For example, when n is 17, the length of the library tag is 8bp, the length of the molecular tag is 9bp, and the molecular tag sequence combination is 3⁹＝19683。

And secondly, every 1-2 bases of the library label and every 1-2 bases of the molecular label are arranged in a cross mode, and the combined label has at most 3 continuous identical bases.

Further, every 1-2 bases of the library tag are arranged across every 1 base of the molecular tag, and the combinatorial tag has a maximum of 3 consecutive identical bases. Reference is made to the following specific examples:

5. when the combined label is ATN₃GCN₆……ACN_n-3TCN_nFrom left to right, position 1, 2, 4, 5, 7, 8, · (n-2), (n-1) is the library tag, and position 3, 6, 9, 12, 15, 18,. ere (n-3), n is the molecular tag.

The base of the molecular tag is different from the base of any library tag to which it is adjacent.

In the case of 1 defined library tag, the number of combinations of the molecular tags is 4^n/3. When n is 18, the length of the library label is 12bp, the length of the molecular label is 6bp, and the combination number of the molecular label sequences is 4⁶＝4069。

6. When the combined label is N₁ATN₄GC……N_n-6ACN_n-3TGN_nFrom left to right, items 2, 3, 5, 6, 8, 9,. (n-2), (n-1)Positions are the library tags, positions 1, 4, 7, 10, 13, 16, 19,. multidot. (n-6), (n-3), n are the molecular tags.

In the case of 1 defined library tag, the number of combinations of the molecular tags is 4^(n+2)/3. When n is 19, the length of the library tag is 12bp, the length of the intermolecular molecular tag sequence in the library is 7bp, and the number of combinations of the molecular tag sequences is 4⁷＝16384。

7. When the combined label is ATN₃GCN₆……ACN_n-4TGN_n-1C, from left to right, position 1, 2, 4, 5, 7, 8, · (n-2), n is the library tag, position 3, 6, 9, 12, 15, 18,. the (n-4), (n-1) is the molecular tag.

In the case of 1 defined library tag, the number of combinations of the molecular tags is 4^(n-1)/3. When n is 19, the length of the library tag is 13bp, the length of the intermolecular molecular tag sequence in the library is 6bp, and the number of combinations of the molecular tag sequences is 4⁶＝4069。

8. When the combined label is TN₂GCN₅ACN₈……TGN_n-2CT, left to right, position 1, 3, 4, 6, 7, · (n-4), (n-3), (n-1), n is the library tag, position 2, 5, 8, 12, 15, 18, · (n-2) is the molecular tag.

In the case of 1 defined library tag, the number of combinations of the molecular tags is 4 (n-1)/3. When n is 13, the length of the library tag is 9bp, the length of the intermolecular molecular tag sequence in the library is 4bp, and the number of combinations of the molecular tag sequences is 4⁴＝256。

Further, every 1 base of the library tag is crossed with every 1-2 bases of the molecular tag, and the combined tag has at most 3 continuous identical bases. Reference is made to the following specific examples:

9. when the combined label is AN₂N₃TN₅N₆……CN_n-4N_n-3GN_n-1N_nFrom left to right, the 1 st, 4 th, 7 th,. n-5, n-2 th positions are the library tags, and the 2 nd, 3 rd, 5 th, 6 th,. n-4, n-3, n-1, n-positions are the molecular tags.

The base of the molecular tag may be any one of four bases.

In the case of 1 defined library tag, the number of combinations of the molecular tags is 4^2n/3. When n is 24, the length of the library label is 8bp, the length of the molecular label is 16bp, and the combination number of the molecular label sequences is 4¹⁶＝4294967296。

10. When the combined label is AN₂N₃TN₅N₆……CN_n-5N_n-4GN_n-2N _n-1T, from left to right, position 1, 4, 7,. n-6, n-3, n is the library tag, position 2, 3, 5, 6,. n-5, n-4, n-2, n-1 is the molecular tag.

The base of the molecular tag may be any one of four bases.

In the case of 1 defined library tag, the number of combinations of the molecular tags is 4^2(n-1)/3. When n is 25, the length of the library label is 8bp, the length of the molecular label is 16bp, and the combination number of the molecular label sequences is 4¹⁶＝4294967296。

11. When the combined label is N₁N₂TN₄N₅A……CN_n-5N_n-4GN_n-2N_n-1T, from left to right, position 3, 6, 9,. n-6, n-3, n is the library tag, position 1, 2, 4, 5, 7,. n-5, n-4, n-2, n-1 is the molecular tag.

The base of the molecular tag may be any one of four bases.

In the case of 1 defined library tag, a combination of said molecular tagsNumber 4^2n/3. When n is 24, the length of the library label is 8bp, the length of the molecular label is 16bp, and the combination number of the molecular label sequences is 4¹⁶＝4294967296。

12. When the combined label is N₁N₂TN₄N₅A……CN_n-4N_n-3GN_n-1N_nFrom left to right, the 3 rd, 6 th, 9 th,. n-5, n-2 th positions are the library tags, and the 1 st, 2 nd, 4 th, 5 th, 7 th,. n-4 th, n-3 th, n-1 th, n positions are the molecular tags.

The base of the molecular tag may be any of four bases, for example N₁N₂TN₄N₅In a … … … …, N may be any one of A, T, C, G.

In the case of 1 defined library tag, the number of combinations of the molecular tags is 4^2(n+1)/3. When n is 26, the length of the library label is 8bp, the length of the molecular label is 18bp, and the number of the molecular label sequence combinations is 4¹⁸＝68719476736。

13. When the combined label is AN₂TN₄N₅GN₇CN₉N₁₀……GN_n-3CN_n-1N_nFrom left to right, the 1 st, 3 rd, 6 th, 8 th,. n-4, n-2 th positions are the library tags, and the 2 nd, 4 th, 5 th, 7 th, 9 th,. n-3, n-1, n-positions are the molecular tags.

The base of the molecular tag may be any one of four bases.

In the case of 1 defined library tag, the number of combinations of the molecular tags is 4^4n/7. When n is 21, the length of the library label is 9bp, the length of the molecular label is 12bp, and the combination number of the molecular label sequences is 4¹²＝16777216。

14. When the combined label is AN₂N₃TN₅GN₇N₈CN₁₀……GN_n-3N_n-2CN_nFrom left to right, the 1 st, 4 th, 6 th, 9 th.. n-4, n-1 th position is the library tag, and the 2 nd, 3 th, 5 th, 7 th, 8 th.. n-3, n-2, n-position is the molecular tagAnd (6) a label.

The base of the molecular tag may be any one of four bases.

15. When the combined label is AN₂N₃TN₅GN₇N₈CN₁₀……GN_n-4N_n-3CN_n-1T, from left to right, position 1, 4, 6, 9,. n-5, n-2, n is the library tag, position 2, 3, 5, 7, 8,. n-4, n-3, n-1 is the molecular tag.

The base of the molecular tag may be any one of four bases.

In the case of 1 defined library tag, the number of combinations of the molecular tags is 4^4(n-1)/7. When n is 22, the length of the library label is 10bp, the length of the molecular label is 12bp, and the combination number of the molecular label sequences is 4¹²＝16777216。

Further, every 1-2 bases of the library tags are arranged across every 1-2 bases of the molecular tags, and the combinatorial tags have a maximum of 3 consecutive identical bases. Reference is made to the following specific examples:

16. when the combined label is AN₂N₃TGN₆CN₈N₉ATN₁₂……GN_n-4N_n-3CAN_nFrom left to right, the 1 st, 4 th, 5 th, 7 th, 10 th, 11 th.. cndot.n-5 th, n-2 th, n-1 th positions are the library tags, and the 2 nd, 3 th, 6 th, 8 th, 9 th, 12 th.. cndot.n-4 th, n-3 th, n-1 th positions are the molecular tags.

The base of the molecular tag may be any one of four bases.

In the case of 1 defined library tag, the number of combinations of the molecular tags is 4^n/2. When n is 16, the length of the library label is 8bp, the length of the molecular label is 8bp, and the sequence of the molecular labelNumber of combinations 4⁸＝65536。

17. When the combined label is ATN₃N₄GN₆CTN₉N₁₀AN₁₂……GCN_n-3N_n-2AN_nFrom left to right, the 1 st, 2 nd, 5 th, 7 th, 8 th, 11 th.. cndot.n-5 th, n-4 th, n-1 th positions are the library tags, and the 3 rd, 4 th, 6 th, 9 th, 10 th, 12 th.. cndot.n-3 th, n-2 th, n-1 th positions are the molecular tags.

The base of the molecular tag may be any one of four bases.

In the case of 1 defined library tag, the number of combinations of the molecular tags is 4^n/2. When n is 16, the length of the library label is 8bp, the length of the molecular label is 8bp, and the combination number of the molecular label sequences is 4⁸＝65536。

And thirdly, every 1-2 bases of the library label and every 2-3 bases of the molecular label are arranged in a cross mode, and the combined label has at most 4 continuous identical bases. Reference is made to the following specific examples:

18. when the combined label is AN₂N₃N₄TGN₇N₈CN₁₀N₁₁N₁₂AT……AN_n-6N_n-5N_n-4TGN_n-1N_nFrom left to right, the 1 st, 5 th, 6 th, 9 th, 13 th, 14 th.. n-7 th, n-3 th, n-2 nd positions are the library tags, and the 2 nd, 3 th, 4 th, 7 th, 8 th, 10 th, 11 th, 12 th.. n-6 th, n-5 th, n-4 th, n-1 th, n-2 nd positions are the molecular tags.

The base of the molecular tag may be any one of four bases.

In the case of 1 defined library tag, the number of combinations of the molecular tags is 4^5n/8. When n is 24, the length of the library label is 9bp, the length of the molecular label is 15bp, and the combination number of the molecular label sequences is 4¹⁵＝1073741824。

19. When the combined label is ATN₃N₄N₅GCN₈N₉N₁₀ATN₁₃N₁₄N₁₅……GCN_n-7N_n-6N_n-5ATN_n-2N_n-1N_nFrom left to right, the 1 st, 2 nd, 6 th, 7 th, 11 th, 12 th,. n-9 th, n-8 th, n-4 th, n-3 rd position is the library tag, and the 3 rd, 4 th, 5 th, 8 th, 9 th, 10 th, 13 th, 14 th, 15 th,. n-7 th, n-6 th, n-5 th, n-2 th, n-1 th, n-3 th position is the molecular tag.

The base of the molecular tag may be any one of four bases.

In the case of 1 defined library tag, the number of combinations of the molecular tags is 4^3n/5. When n is 20, the length of the library label is 8bp, the length of the molecular label is 12bp, and the combination number of the molecular label sequences is 4¹²＝16777216。

And fourthly, every 1-2 bases of the library label and every 1-3 bases of the molecular label are arranged in a cross mode, and the combined label has at most 4 continuous identical bases. Reference is made to the following specific examples:

20. when the combined label is AN₂N₃N₄TGN₇N₈CN₁₀……AN_n-8N_n-7N_n-6TGN_n-3N_n-2CN_nFrom left to right, the 1 st, 5 th, 6 th, 9 th,... cndot.n-9 th, n-5 th, n-4 th, n-1 th positions are the library tags, and the 2 nd, 3 th, 4 th, 7 th, 8 th, 10 th,. cndot.n-8 th, n-7 th, n-6 th, n-3 th, n-2 th, n-1 th positions are the molecular tags.

The base of the molecular tag may be any one of four bases.

In the case of 1 defined library tag, the number of combinations of the molecular tags is 4^6n/10. When n is 20, the length of the library label is 8bp, the length of the molecular label is 12bp, and the combination number of the molecular label sequences is 4¹²＝16777216。

21. When the combined label is ATN₃N₄N₅GN₇ATN₁₀N₁₁N₁₂GN₁₄……ATN_n-4N_n-3N_n-2GN_nFrom left to right, the 1 st, 2 nd, 6 th, 8 th, 9 th, 13 th.... gtn-6 th, n-5 th, n-1 th are the library tags, and the 3 rd, 4 th, 5 th, 7 th, 10 th, 11 th, 12 th, 14 th.. gtn-7 th, n-6 th, n-5 th, 1 th, 6 th, 9 th, 6 th, n-5 th, n-,The n-2, n-1 and n positions are the molecular labels.

The base of the molecular tag may be any one of four bases.

The invention solves the problem that in the prior art, in order to avoid a plurality of continuous identical bases in a molecular label, U bases are added in the molecular label to separate the molecular label (NNNUUUNNUUUNNNN). The library label and the random molecular label are combined together for the first time, so that the library label and the molecular label with enough lengths can be ensured by increasing the length of the effective molecular label on the premise of ensuring no invalid length, and the requirements of specific schemes are met.

According to the specific embodiment of the invention, the length of the molecular tag is 6-18 bp, and the length of the library tag is 8-12 bp.

The invention also provides an adaptor, wherein the adaptor contains the combined label, and the combined label is positioned at any position of the adaptor except 20bp bases at the tail end of the overhang T and the non-overhang.

According to a specific embodiment of the invention, the adaptor further comprises a discriminating signature sequence of 4 non-repeating bases, said discriminating signature sequence being linked to the 3 'end or the 5' end of the combined tag.

The invention also provides a method for determining that a target region of a sample to be detected contains a low-frequency mutant nucleic acid sequence, which comprises the following steps as shown in figure 1:

The scheme of the invention will be explained with reference to the examples. It will be appreciated by persons skilled in the art that the following examples are illustrative only and are not to be construed as limiting the invention. Reagents, sequences (adaptors, tags and primers), software and equipment not specifically submitted to the following examples are conventional commercial products or open sources, unless otherwise submitted.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

Example 1 method for determining Low-frequency mutant nucleic acid sequence in target region of sample to be tested

1. Designing a combined label and a joint containing the combined label.

The combinatorial tag is designed according to the way that the library tag and the molecular tag are arranged in a single base crossing mode, and the combinatorial tag contains at most 2 continuous identical bases. A group of 16 combined labels is designed according to the experimental requirements. As shown in table 1, 16 combination tags:

TABLE 1

Wherein underlined bases are molecular tag sequences and non-underlined bases are library tag sequences.

The combinatorial tags designed above are designed as a set of adapters, where the combinatorial tags can be located anywhere on the adapters except for the 20bp bases at the end of the overhang "T" and the non-overhang. NNN.. NNN represents a combinatorial tag, and the type of adaptor may be a fully complementary double stranded structure, a Y-type structure with one end complementary and one open end, or a Y-type structure in which a combinatorial tag can be introduced into an adaptor by PCR, as shown in fig. 2, 3, 4, and 5. The combined labels can be only positioned at any end or middle of the joint, or can be distributed at 2 or more than 2 positions, the number of N represents the number of bases of the combined labels, and the number of bases at the position can be increased when more types of the combined labels are needed, for example, 8bp, 12bp, 16bp, 24bp or more bases are adopted.

As shown in table 2, 16 linkers containing different combination tags:

TABLE 2

When the linker is as shown in FIG. 1 and FIG. 2 and the like, it is necessary to design the structure containing the reverse complement of the combinatorial tag at the same time, for example, it is necessary to design the F-directional sequence and the R-directional sequence in Table 2 at the same time, and FIGS. 3 and 4 and the like only need to design the single-stranded combinatorial tag, for example, the F-directional sequence in Table 2, and it is not necessary to design the reverse complement of the combinatorial tag.

Depending on the needs of the experiment, identifying signature sequences and/or library tags may also be added at the 3 'or 5' end of the combinatorial tags. For example, when sequencing using the Ion Torrent platform, Barcode sequences that identify different samples can be added to it.

2. Synthesis of linkers containing combinatorial tags

And synthesizing the designed combined label or the corresponding reverse complementary sequence thereof and the sequences of the 3 'end and the 5' end thereof according to the designed joint sequence to obtain the joint containing the combined label. As will be understood by those skilled in the art, the synthesis method may be any method known in the art, or may be entrusted to a primer synthesis company.

3. Diluting the obtained joint into working solution for later use.

4. Extraction of sample DNA

The patient's peripheral EDTA anticoagulated blood was withdrawn in 10ml and the plasma was freshly centrifuged and the plasma DNA extracted according to methods well known to those skilled in the art.

5. DNA end repair

The extracted DNA solution and the mixed solution of the end-repairing reagent are mixed, and the mixture is reacted according to an end-repairing method well known to those skilled in the art, and then separated and purified after the reaction is finished.

5.1 the following reaction system was formulated in a 1.5ml EP tube:

reagent	Volume/ul
		DNA	50
10 XPNK buffer	5
		dNTP solution (10mM)	2
T4DNA polymerase	1
		T4PNK	1
KLENOW fragment (10-fold dilution)	1
		Total volume/ul	50

And (3) uniformly mixing at room temperature, slightly centrifuging, placing the reaction system in a PCR instrument, reacting for 30 minutes at 20 ℃, and purifying by using AMpure XP magnetic beads after the reaction is finished.

5.2 add 90ul magnetic beads to 50ul system reaction product, after AMpure XP magnetic beads purification, repeatedly wash twice with 500ul 75% ethanol, discard supernatant. Drying at 37 ℃ until the magnetic beads are dried. Add 23ul of water, mix the beads well, and suck 22ul of supernatant after clarification.

6. Coupling reaction

And (3) mixing the DNA solution with the repaired tail end with the working solution containing the joint of the combined label and the mixed solution of the connecting reaction reagent obtained in the step (3), reacting according to a joint adding method well known by a person skilled in the art, and separating and purifying after the reaction is finished.

6.1 preparing a reaction solution from the solution obtained in the step 5 according to the following system:

6.2 magnetic bead purification was carried out by the method shown in 5.2, except that 75. mu.l of magnetic beads were added to 50. mu.l of the reaction product in the system, and the reaction product was washed twice with 500. mu.l of 75% ethanol, and the supernatant was discarded. Drying at 37 ℃ until the magnetic beads are dried. Add 36ul of water, mix the beads well, and aspirate 34.5ul of supernatant after clarification.

7. PCR enrichment and sequencing library construction

Mixing the DNA added with the joint and the mixed solution of the PCR reaction reagent uniformly, carrying out PCR reaction according to a method well known by a person skilled in the art, carrying out separation and purification after the reaction is finished, carrying out QC detection on the library after the library is constructed, and waiting for sequencing after the library is qualified.

7.1 reaction solutions were prepared in 1 new PCR tube according to the following system:

reagent	Volume/ul
		DNA	34.5
10×PfxAmplification buffer	5
		dNTP solution (10mM)	5
MgSO₄(50mM)	2
		PCR primer PE1(10pmol/ul)	4
PCR primer PE2(10pmol/ul)	4
		Pfx DNA polymerase	1
Total volume/ul	50

Mixing evenly at room temperature, slightly centrifuging, placing the reaction system in a PCR instrument, and reacting according to the following conditions:

after the reaction was completed, purification was performed using AMpure XP magnetic beads.

7.2 magnetic bead purification was carried out by the method shown in 5.2, except that 50. mu.l of magnetic beads were added to 50. mu.l of the reaction product in the 50. mu.l system. The library construction is finished.

8. Library quality inspection

QPCR and Agilent 2100 detection are carried out on the library, and qualified library quality inspection is arranged on a computer.

9. DNA sequencing of the library

The library can be sequenced using a second generation sequencer such as Ion Torrent Proton, Ion Torrent PGM, and the like.

10. Analysis of sequencing results

Analyzing the sequencing result of the DNA obtained after sequencing, classifying the obtained DNA sequences according to the combined labels, and taking the sequences carrying the same combined labels as 1 'molecular cluster', wherein the molecular cluster is 1 type of DNA formed by PCR of the initial 1 DNA molecule, namely the 'copied strand' of the positive strand and the negative strand of the original DNA molecule.

The base type of each base position in the molecular cluster and the frequency of the occurrence of the base type are counted.

Based on the data analysis, errors due to PCR and sequencing were found and corrected.

Thus obtaining the correct sequence of the original DNA, and finding out the real mutation sequence through the interior of the molecular cluster and parallel comparison.

Example 2

The method for determining the low frequency mutation-containing nucleic acid sequence in the target region of the sample to be tested is basically the same as that in example 1, except that 2 bases of the library tag and 1 base of the molecular tag are arranged in a cross manner in step 1.

As shown in table 3 below:

linker P1 sequence 5 '-3':

SEQ ID NO 46:CCTCTCTATGGGCAGTCGGTGAT。

Example 3

The method for determining that the target region of the sample to be detected contains the low-frequency mutation nucleic acid sequence is basically the same as that in the embodiment 1, the difference is that 1-2 bases of the library tag and 1-2 bases of the molecular tag are arranged in a cross mode in the step 1.

As shown in table 4 below:

linker P1 sequence 5 '-3':

SEQ ID NO 59:CCTCTCTATGGGCAGTCGGTGAT。

Example 4

The method for determining that the target region of the sample to be detected contains the low-frequency mutation nucleic acid sequence is basically the same as that in the embodiment 1, the difference is that in the step 1, 1-2 bases of the library tag and 2-3 bases of the molecular tag are arranged in a cross mode.

As shown in table 5 below:

linker P1 sequence 5 '-3':

SEQ ID NO 72:CCTCTCTATGGGCAGTCGGTGAT。

The above-mentioned embodiments are merely illustrative of the preferred embodiments of the present invention, and do not limit the scope of the present invention, and various modifications and improvements made to the technical solution of the present invention by those skilled in the art without departing from the spirit of the present invention shall fall within the protection scope defined by the claims of the present invention.

Claims

1. A method for determining that a target region of a sample to be tested contains a low-frequency mutant nucleic acid sequence comprises the following steps:

s1, performing a joint adding reaction on the target region nucleic acid of the sample to be detected by using a joint, wherein the joint contains a combined label, the combined label comprises a molecular label and a library label, the base of the molecular label and the base of the library label are arranged in a cross way, the combined label is positioned at any position of the joint except for 20bp base at the tail end of an overhang end T and a non-overhang end, performing PCR amplification on the target region nucleic acid of the sample to be detected after the joint is added, and obtaining an amplification product, wherein the amplification product forms a target region nucleic acid sequencing library of the sample to be detected;

2. The method of claim 1, wherein the adaptor further comprises an identifying signature sequence of 4 non-repeating bases, wherein the identifying signature sequence is linked to the 3 'end or the 5' end of the combined tag.

3. The method of claim 1, wherein every 1-2 bases of the library tag are crossed with every 1-3 bases of the molecular tag.

4. The method of claim 3, wherein every 1 base of the library tag crosses every 1 base of the molecular tag, and the combinatorial tag has at most 2 consecutive identical bases.

5. The method of claim 3, wherein every 1-2 bases of the library tag are crossed with every 1-2 bases of the molecular tag, and the combined tag has at most 3 consecutive identical bases.

6. The method of claim 3, wherein every 1-2 bases of the library tag are crossed with every 2-3 bases of the molecular tag, and the combined tag has at most 4 consecutive identical bases.

7. The method of claim 3, wherein every 1-2 bases of the library tag are crossed with every 1-3 bases of the molecular tag, and the combined tag has at most 4 consecutive identical bases.

8. The method of claim 1, wherein the molecular tag has a length of 6-18 bp and the library tag has a length of 8-12 bp.