WO2023092601A1 - Umi分子标签及其应用、接头、接头连接试剂及试剂盒和文库构建方法 - Google Patents

Umi分子标签及其应用、接头、接头连接试剂及试剂盒和文库构建方法 Download PDF

Info

Publication number
WO2023092601A1
WO2023092601A1 PCT/CN2021/134159 CN2021134159W WO2023092601A1 WO 2023092601 A1 WO2023092601 A1 WO 2023092601A1 CN 2021134159 W CN2021134159 W CN 2021134159W WO 2023092601 A1 WO2023092601 A1 WO 2023092601A1
Authority
WO
WIPO (PCT)
Prior art keywords
bases
umi
fixed
random
strand
Prior art date
Application number
PCT/CN2021/134159
Other languages
English (en)
French (fr)
Inventor
叶邦全
Original Assignee
京东方科技集团股份有限公司
成都京东方光电科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 京东方科技集团股份有限公司, 成都京东方光电科技有限公司 filed Critical 京东方科技集团股份有限公司
Priority to CN202180003697.6A priority Critical patent/CN116529430A/zh
Priority to PCT/CN2021/134159 priority patent/WO2023092601A1/zh
Priority to US17/912,373 priority patent/US20240209349A1/en
Publication of WO2023092601A1 publication Critical patent/WO2023092601A1/zh

Links

Images

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12NMICROORGANISMS OR ENZYMES; COMPOSITIONS THEREOF; PROPAGATING, PRESERVING, OR MAINTAINING MICROORGANISMS; MUTATION OR GENETIC ENGINEERING; CULTURE MEDIA
    • C12N15/00Mutation or genetic engineering; DNA or RNA concerning genetic engineering, vectors, e.g. plasmids, or their isolation, preparation or purification; Use of hosts therefor
    • C12N15/09Recombinant DNA-technology
    • C12N15/10Processes for the isolation, preparation or purification of DNA or RNA
    • C12N15/1034Isolating an individual clone by screening libraries
    • C12N15/1065Preparation or screening of tagged libraries, e.g. tagged microorganisms by STM-mutagenesis, tagged polynucleotides, gene tags
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • CCHEMISTRY; METALLURGY
    • C40COMBINATORIAL TECHNOLOGY
    • C40BCOMBINATORIAL CHEMISTRY; LIBRARIES, e.g. CHEMICAL LIBRARIES
    • C40B40/00Libraries per se, e.g. arrays, mixtures
    • C40B40/04Libraries containing only organic compounds
    • C40B40/06Libraries containing nucleotides or polynucleotides, or derivatives thereof

Definitions

  • the disclosure relates to the field of biotechnology, in particular to a UMI molecular tag and its application, a linker, a linker ligation reagent, a kit and a library construction method.
  • NGS Next Generation Sequencing
  • a UMI molecular tag comprising: at least one random base and at least one fixed base.
  • At least one of the random bases and the fixed bases is multiple; a plurality of the random bases and/or a plurality of the fixed bases are arranged continuously; or, a plurality of random bases At least two random bases are arranged at intervals among the bases, and/or at least two fixed bases are arranged at intervals among the plurality of fixed bases.
  • the random bases are at least three; the at least three random bases are arranged at intervals, and every two random bases arranged at intervals are separated by a set of fixed bases, Each two groups of fixed bases contain the same number of fixed bases.
  • At least one fixed base in one group of fixed bases is different from one fixed base in the other group of fixed bases.
  • the number of random bases is 3.
  • the UMI molecular tag includes 7-11 bases.
  • a set of molecular tags including: two UMI molecular tags, the two UMI molecular tags are combined through at least part of complementary base pairing; wherein, at least one UMI molecular tag is the UMI molecule as described above Label.
  • a linker comprising: a first strand and a second strand; and at least one UMI molecular tag, each UMI molecular tag is located on the first strand or the second strand, and the at least one UMI molecular tag is the UMI molecular tag as described above.
  • the two UMI molecular tags are respectively located on the first strand and the second strand, and are combined by at least part of complementary base pairing.
  • the first strand is a forward strand
  • the second strand is a reverse strand
  • the first strand includes a first sequencing primer sequence
  • the second strand includes a second sequencing primer sequence
  • the UMI molecular tag on the first strand is located downstream of the first sequencing primer sequence
  • the UMI molecular tag on the second strand is located upstream of the second sequencing primer sequence.
  • multiple types of linkers are the above-mentioned linkers; in the multiple types of linkers, at least one random base of at least one UMI molecular tag contained in each two kinds of linkers is different.
  • kits comprising: the adapter ligation reagent as described above.
  • the genes include DNA molecules for expression of genetic information; the UMI molecular tags are configured to mark different DNA molecules.
  • a DNA library construction method comprising:
  • a gene sequencing detection method comprising: using the DNA library obtained by the DNA library construction method as described above to perform gene sequencing on the DNA.
  • kits comprising: the DNA library obtained by the DNA library construction method described above.
  • Figure 1 is a structural diagram of a Y-joint according to some embodiments.
  • Figure 2 is a flowchart of a sequencing method according to some embodiments.
  • FIG. 3 is a structural diagram of another Y-joint according to some embodiments.
  • Figure 4 is a structural diagram of a UMI molecular tag set according to some embodiments.
  • FIG. 5 is a flowchart of a method for preparing a joint according to some embodiments.
  • Fig. 6 is a capillary electrophoresis peak diagram for detecting the synthesis efficiency of the double-stranded linkers of Example 1, Example 2 and Example 3 according to some embodiments.
  • first and second are used for descriptive purposes only, and cannot be understood as indicating or implying relative importance or implicitly specifying the quantity of indicated technical features. Thus, a feature defined as “first” and “second” may explicitly or implicitly include one or more of these features. In the description of the embodiments of the present disclosure, unless otherwise specified, "plurality” means two or more.
  • At least one of A, B and C has the same meaning as “at least one of A, B or C” and both include the following combinations of A, B and C: A only, B only, C only, A and B A combination of A and C, a combination of B and C, and a combination of A, B and C.
  • a and/or B includes the following three combinations: A only, B only, and a combination of A and B.
  • DNA is an abbreviation for Deoxyribonucleic Acid (DeoxyriboNucleic Acid).
  • DNA is the carrier of genetic information in biological cells, and its main function in the body is to guide the synthesis of RNA and protein.
  • DNA is a macromolecular polymer composed of deoxynucleotides, which are composed of phosphoric acid, deoxyribose and bases; among them, there are four main types of bases, namely A (adenine), G (guanine), C (cytosine) and T (thymine).
  • RNA is an abbreviation for ribonucleic acid (Ribonucleic Acid).
  • RNA is a genetic information carrier that exists in biological cells and some viruses and viroids. Its role in the body is mainly to guide the synthesis of proteins.
  • RNA is a macromolecular polymer composed of ribonucleotides. Ribonucleotides are composed of phosphoric acid, ribose and bases; among them, there are mainly four types of bases, namely A (adenine), G (guanine), C (cytosine) and U (uracil).
  • next-generation sequencing technology is widely used in the fields of reproductive genetics and tumor detection, especially in liquid biopsy.
  • PCR Polymerase Chain Reaction, polymerase chain reaction
  • the error rate of the base read by the sequencer is 0.01% to 0.1% (that is, there will be 1 to 10 wrong bases for every 1000 bases).
  • Noise mutations also known as exogenous mutations
  • UMI molecular tags (Unique Molecular indentifier) are introduced into the original DNA fragments.
  • UMI (Unique Molecular indentifier) molecular tags are also called molecular barcodes. Its principle is to give each original DNA fragment A unique tag sequence is added, and sequenced together after library construction and PCR amplification. In this way, according to different tag sequences, we can distinguish DNA templates from different sources (subsequently referred to as DNA molecules), and distinguish which ones are false positive mutations caused by random errors in the PCR amplification and sequencing process, and which ones are actually carried by patients. Mutations, thereby improving detection sensitivity and specificity.
  • the UMI molecular tag marks the original DNA fragment, which originates from different DNA molecules with different molecular tags, and the same insert fragment (that is, the original DNA fragment) is screened out when analyzing the sequencing results.
  • Both ends of an insert fragment have complementary paired UMI adapters, that is, UMI adapters can be used to mark the forward and reverse strands (forward strand and reverse strand) of the same insert fragment, if the mutated base at the same position is in the forward and reverse strands If both appear, it is marked as a real mutation, and its original mutation state is restored in this way.
  • UMI molecular tags for example, 8 random bases can be added to the P5 end of the linker instead of Index.
  • the linker synthesized by this method has a simple, economical and applicable Advantages, it has been widely used, but in the process of building a library, UMI adapters are randomly connected, which will cause an original DNA fragment to be connected to two different UMI adapters, resulting in different UMI markers on the forward and reverse strands, so it cannot Tracking the original positive and negative strand information cannot accurately correct the positive and negative strand sequences, and if a base mutation occurs in the UMI sequence, the number of bases in the original DNA fragment will increase, introducing potential false positive mutations.
  • the introduction of double-ended UMI molecular tags that is, in related technologies, first synthesize a single-stranded adapter (sequence), the single-stranded adapter (sequence) includes a first sequence and a second sequence, wherein the second sequence includes The protection base of the restriction endonuclease and the double-stranded molecular label of the random base, followed by annealing the single-stranded adapter sequence to form a double-stranded adapter, and finally the 3'-dT-tailed adapter can be obtained by enzyme digestion, so that Although the double-stranded linker can effectively solve the problem that the single-ended UMI cannot track the original positive and negative strands, but when the UMI sequence itself is mutated, false positive mutations will also be introduced.
  • the adapter 10 includes: a first strand 11 and a second strand 12, and at least one UMI molecular tag 20, each UMI molecular tag 20 is located on the first strand 11 or on the second chain 12.
  • the connector 10 can be divided into a long connector (complete Y-type connector) and a short connector (incomplete Y-type connector) according to whether it can match a PCR-free library.
  • the long adapter is connected to both ends of the DNA fragment to be tested (that is, the original DNA fragment as described above) by TA ligation.
  • the library yield is sufficient, it can be directly sequenced on the machine without PCR amplification;
  • the TA connection method is connected to both ends of the DNA fragment to be tested, it must be PCR-amplified using Indexing Primers complementary to the short adapter to become a complete adapter before it can be sequenced on the machine.
  • the Index sequence is configured to mark different samples of the sequence to be tested.
  • a sample can include thousands of DNA molecules, and UMI molecular tags 20 are used to mark different DNA molecules in the same sample or different samples. mark.
  • the connector 10 can be divided into a single-ended Index connector and a double-ended Index connector.
  • the single-ended Index connector only has an Index sequence at the P7 end, and the double-ended Index connector exists at both ends of P5 and P7. There is an Index sequence.
  • the UMI molecular tag 20 can be added to the P7 end instead of the Index sequence.
  • the first strand 11 may include the first PCR amplification primer 111 (also known as P5) and the first sequencing primer sequence 112 (R1SP) sequentially from the 5' end
  • the second strand 12 may include sequentially from the 5' end
  • the second sequencing primer sequence 121 (R2SP), the UMI molecular tag 20 and the second PCR amplification primer 122 also known as P7. That is, the connector 10 is a single-ended UMI connector.
  • the at least one UMI molecular tag 20 includes at least one random base and at least one fixed base.
  • the number and arrangement of random bases and fixed bases in one UMI molecular tag 20 are not specifically limited.
  • the random base and the fixed base can be arranged in the same direction, for example: the random base and the fixed base are arranged according to the 5' from the UMI sequence The direction from the end to the 3' end is sequentially arranged, or the random bases and the fixed bases are arranged sequentially in the direction from the 3' end to the 5' end of the UMI sequence.
  • taking at least one of the random base and the fixed base as an example there are two possible situations.
  • the fixed base can be located on one side of the plurality of random bases (for example, the direction from the 5' end to the 3' end of the UMI sequence is called the first direction, and the direction from the UMI sequence to the 3' end is called the first direction.
  • the direction from the 3' end to the 5' end is called the second direction, and the fixed base can be located on one side of the first direction or the second direction of multiple random bases).
  • the random base can be located on one side of multiple fixed bases (such as the 5' end to the 3' end of the UMI sequence).
  • the direction is called the first direction
  • the direction from the 3' end to the 5' end of the UMI sequence is called the second direction
  • random bases can be located on one side of the first direction or the second direction of multiple fixed bases).
  • multiple random bases and multiple fixed bases are arranged continuously. At this time, multiple fixed bases can be located in multiple random bases.
  • One side of the base (for example, the direction from the 5' end to the 3' end of the UMI sequence is called the first direction, and the direction from the 3' end to the 5' end of the UMI sequence is called the second direction, and multiple fixed bases can be located on one side of the first direction or the second direction of a plurality of random bases).
  • the second case there are multiple random bases and/or fixed bases, and at least two of the multiple random bases are arranged at intervals, and/or, at least two of the multiple fixed bases are fixed Alignment of bases.
  • the first case there are multiple random bases and one fixed base. In this case, The fixed base is located between any two adjacent random bases among the plurality of random bases.
  • the second case there are multiple fixed bases and one random base. In this case, the random base is located between any two adjacent fixed bases among the multiple fixed bases.
  • the first arrangement there are at least two random bases in the multiple random bases. Arranged, multiple fixed bases are located between any two randomly spaced bases.
  • the second arrangement mode at least two of the multiple fixed bases are spaced apart, and multiple random bases are located between any two of the spaced fixed bases.
  • At least two random bases are arranged at intervals among multiple random bases, and at least two fixed bases are arranged at intervals among multiple fixed bases.
  • multiple random bases and multiple fixed bases are arranged at intervals.
  • the bases are arranged at least two bases at intervals, and there can be one or more fixed bases between the two random bases arranged at intervals, and one or more random bases can also be separated between the two fixed bases arranged at intervals. base.
  • the random base means that the base is random, and can be selected from any one of the four bases (A, T, C, and G), and can be represented by N. Random bases are selected from different bases and can be used to label different DNA molecules.
  • the N in the UMI molecular tag 20 can be selected from any of the four bases.
  • the N in the UMI molecular tag 20 Different, 4 kinds of UMI molecular tags can be obtained, and these 4 kinds of UMI molecular tags 20 can be made into 4 +2 (that is, 16) joints (one DNA molecule connects two joints), so that 4 +2 (that is, 16 1) different DNA molecules are labeled, and then the detection of 42 (that is, 16) different DNA molecules is completed.
  • each N in the UMI molecular tag 20 can be selected from any of the 4 bases.
  • the UMI molecular tag 20 There are 4+ 3 (that is, 64 kinds) combinations of 3 Ns, and 4 +3 (that is, 64 kinds) UMI molecular tags 20 can be obtained, and these 64 kinds of UMI molecular tags 20 can be made into 64 +2 (that is, 4096 ) adapter (one DNA molecule connects two adapters), so that 64 + 2 (that is, 4096) different DNA molecules can be labeled, and then the detection of 64 + 2 (that is, 4096) different DNA molecules can be completed.
  • the fixed bases are selected from fixed known bases, and are used to correct the side sequences and UMI molecular tags themselves when errors occur in amplification or sequencing, so as to reduce the introduction of false positive mutations.
  • the original DNA fragments are 100, and the starting position and the ending position are the same (that is, the sequence is the same), which are respectively recorded as original sequence 1, original sequence 2, original sequence 3, ..., The original sequence 99 and the original sequence 100, wherein, the original sequence 2 is a mutated sequence, and the real mutation frequency is 1%.
  • the sequence of sequence 100 is still recorded as original sequence 1, original sequence 2, original sequence 3, ..., original sequence 99 and original sequence 100, and the 100 original sequences connected with UMI adapters are enriched by PCR amplification to obtain DNA Library, the DNA library includes 100 original sequences 1 connected with UMI joints (in order to distinguish, the remaining 99 original sequences 1 connected with UMI joints copied are recorded as original sequences 1 '), wherein, as shown in Figure 2
  • the original sequence 1 it can be judged by the AAGCT on the UMI adapter that 99 original sequences 1' connected with the UMI adapter are copied by PCR amplification, because the detection site of the original sequence 1 is A base
  • the copied 99 original sequence 1' should also be A base, but if the 100th original sequence 1' is C base, it can be judged that this is the noise caused by PCR amplification error or sequencing error mutation.
  • the UMI molecular tag 20 is a molecular tag composed of random bases, it will be judged that the DNA sequence and the UMI adapter are both real Mutation, leading to the introduction of false positive mutations, and in the embodiment of the present disclosure, as in the second case in Figure 2, since the middle base of the UMI molecular label 20 of these 5 bases is fixed as a G base, according to these 5
  • the UMI molecular tag 20 of 1 base is AAGCT, not AATCT, and it can be determined that the UMI molecular tag 20 in the 100th original sequence 1' is also a noise mutation introduced by PCR amplification or sequencing, and according to these 5 bases
  • the DNA sequence in the 1' of the remaining 99 original sequences marked by the UMI molecular tag 20 has no mutation, so it can be determined that the DNA sequence in the 1' of the 100th original sequence is also a noise mutation introduced by PCR amplification or sequencing
  • UMI molecular tags 20 with partially fixed bases, the diversity of adapters can be guaranteed, different original DNA fragments can be marked, and the noise mutations introduced by PCR amplification or sequencing can be eliminated to a certain extent, so that The detection accuracy can be improved.
  • the random bases are at least three, and the at least three random bases are arranged with intervals between each pair, and there is a set of fixed bases between every two random bases arranged at intervals.
  • the two sets of fixed bases contain the same number of fixed bases.
  • the UMI molecular tag 20 marks at least 4096 different DNA molecules, increasing the number of molecules to be detected, thereby improving the detection accuracy of the sample , at the same time, by making a set of fixed bases between every two random bases arranged at intervals, and the number of fixed bases contained in each two groups of fixed bases is the same, the ratio of random bases and fixed bases can be improved. Regularity, so that it is easier to identify whether it is a mutation of a fixed base or a mutation of the original DNA fragment itself, reducing the introduction of false positive mutations and improving detection accuracy. In addition, it was found through testing that when the number of bases in the UMI molecular tag 20 is the same, multiple random bases are arranged at intervals, and the detection accuracy is higher compared with the continuous arrangement of multiple random bases.
  • the fixed bases serve to exclude noise mutations introduced by PCR amplification or sequencing
  • the error tolerance during detection The better the accuracy, the better the detection accuracy.
  • UMI molecular tag 20 including 3 random bases and 4 fixed bases as an example.
  • UMI molecular tag 20 has 4 fixed bases for Fault tolerance, the fault tolerance rate can be divided by 4 divided by 7 times 100%, which is about 57%.
  • 2 to 4 fixed bases may be the same or different, which is not specifically limited here.
  • the 2-4 fixed bases are all different.
  • At least one fixed base in one group of fixed bases is different from one fixed base in the other group of fixed bases.
  • a UMI molecular tag 20 there are 3 random bases, the 3 random bases are arranged in pairs, and there is a fixed base between each two random bases (that is, two adjacent Each group of fixed bases includes 1 fixed base) as an example, because in every two adjacent groups of fixed bases, at least one of the fixed bases in one group of fixed bases and one of the other fixed bases
  • the fixed bases are different, therefore, the sequence of the UMI molecular tag can be expressed as follows:
  • N1 and N2 are different and are respectively selected from any one of A, T, C and G, and the three Ns may be the same or different, and are independently selected from any one of A, T, C and G.
  • the two adjacent fixed bases are A and C respectively, and these two fixed bases are different.
  • N1 and N2 being selected from the same base, it can prevent the concentration of the same fluorescence (marking the same base) during sequencing (the same base type is likely to cause the same fluorescence concentration), thereby avoiding the concentration of fluorescence. Take inaccurate questions and improve detection accuracy.
  • a UMI molecular tag 20 there are 3 random bases, and the 3 random bases are arranged in pairs, and every two random bases are spaced by 2 fixed bases (that is, adjacent Two groups of fixed bases all include 2 fixed bases) as an example, because in every adjacent two groups of fixed bases, there is at least one fixed base in one group of fixed bases and the other group of fixed bases A fixed base is different, therefore, the sequence of the UMI molecular tag can be expressed as follows:
  • N3 and N4 can be the same or different, and are independently selected from any of A, T, C and G
  • N5 and N6 are the same or different, and are independently selected from any of A, T, C, and G.
  • One, and at least one of N3 and N4 is different from any one of N5 and N6, and the three Ns are the same or different, and are independently selected from any of A, T, C and G.
  • N3 and N4 there can be two possible situations.
  • N3 and N4 are the same.
  • N5 and N6 are the same, there can be two possible situations.
  • the fixed bases are all different, or, two fixed bases in N5 and N6 are different from the two fixed bases in N3 and N4, wherein, there is one fixed base in N5 and N6 that is different from N3 and N4
  • N5 and N6 can be selected from A and C, A and G, or A and T respectively, and N5 and N6 are respectively selected from In the case of being selected from A and C, the adjacent
  • one fixed base (C) in one set of fixed bases is connected Two fixed bases (A) in the fixed base are different; Under the situation that N5 and N6 are selected from A and G respectively, adjacent two groups of fixed bases are respectively AA and AG, and among these two groups of fixed bases , where one fixed base (G) in one set of fixed bases is different from two fixed bases (A) in the other set of fixed bases; in the case where N5 and N6 are selected from A and T respectively, the corresponding The adjacent two groups of fixed bases are AA and AT respectively.
  • one fixed base (T) in one group of fixed bases is connected with two fixed bases in the other group of fixed bases. (A) is different.
  • N5 and N6 can be selected from C and G respectively , C and T, or G and T, when N5 and N6 are respectively selected from C and G, the adjacent two groups of fixed bases are AA and CG respectively, and among these two groups of fixed bases, one of them is fixed
  • the two fixed bases (C and G) in the base are all different from the two fixed bases (A) in another group of fixed bases; in the case where N5 and N6 are selected from C and T respectively, the corresponding The adjacent two groups of fixed bases are AA and CT respectively.
  • the two fixed bases (C and T) in one group of fixed bases are the same as the two fixed bases in the other group of fixed bases.
  • the fixed bases (A) are all different; when N5 and N6 are selected from G and T respectively, the adjacent two groups of fixed bases are AA and GT respectively, and among these two groups of fixed bases, one of them is fixed
  • the two fixed bases (G and T) in the base are also different from the two fixed bases (A) in the other set of fixed bases.
  • N5 and N6 are the same.
  • at least one fixed base in N5 and N6 is different from one fixed base in N3 and N4, which means that two fixed bases in N5 and N6 are different from N3 and N6.
  • the two fixed bases in N4 are all different.
  • N3 and N4 are still selected from A as an example.
  • N5 and B6 can be selected from T, G or C. In the case that N5 and N6 are all selected from T
  • the The two fixed bases (G) are different from the two fixed bases (A) in N3 and N4
  • the two fixed bases (C) in N5 and N6 ) are different from the two fixed bases (A) in N3 and N4.
  • N3 and N4 are different.
  • N5 and N6 are different.
  • there is at least one fixed base in N5 and N6 base is different from one fixed base in N3 and N4, which means that one fixed base in N5 and N6 is different from one fixed base in N3 and N4, or that two fixed bases in N5 and N6 are different from N3 are different from the two fixed bases in N4, wherein, in the case that one fixed base in N5 and N6 is different from one fixed base in N3 and N4, N3 and N4 are respectively selected from A and T
  • N5 and N6 can be selected from A and C, A and G, T and C, or T and G, etc., in the case that N5 and N6 are respectively selected from A and C, a fixed base in N5 and N6 (C) Different from one fixed base (T) in N3 and N4, in the case where N5 and N6 are selected from
  • N5 and N6 are selected from T and C respectively
  • one fixed base (C) in N5 and N6 is different from one fixed base (T) in N3 and N4.
  • N5 and N6, respectively When selected from T and G, one fixed base (G) in N5 and N6 is different from one fixed base (T) in N3 and N4.
  • N5 and N6 can be selected from G and C, when N5 and N6 are selected from G and C respectively, the two fixed bases (G and C) in N5 and N6 are different from the two fixed bases (A and T) in N3 and N4 .
  • N5 and N6 are the same.
  • at least one fixed base in N5 and N6 is different from one fixed base in N3 and N4, which means that two fixed bases in N5 and N6 are different from N3 and N6.
  • One or two fixed bases in N4 are different, as an example, N3 and N4 are still selected from A and T as an example, N5 and N6 can be selected from A, T, C or G, and both N5 and N6 can be selected from
  • A two fixed bases in N5 and N6 are different from one fixed base in N3 and N4, and in the case that N5 and N6 are all selected from T, the two fixed bases in N5 and N6 are different from N3 Different from one fixed base in N4, when both N5 and N6 are selected from C, the two fixed bases in N5 and N6 are different from the two fixed bases in N3 and N4, and in N5 and N6 When both are selected from G, the two fixed bases in N5 and N6 are also different from the two fixed bases in
  • the random bases are limited to 3, and 4096 different DNA molecules can be labeled, so that the application requirements can be met.
  • the UMI molecular tag 20 includes 7-11 bases.
  • the UMI molecular tag 20 by limiting the number of bases contained in the UMI molecular tag 20 to 7 to 11, it is possible to avoid the UMI molecular tag 20 being too long to subsequently occupy the sequencing data, and the UMI molecular tag 20 being too short to improve fault tolerance Problems that are not conducive to labeling a large number of DNA molecules (such as too few random bases) and/or unfavorable to labeling of large numbers of DNA molecules.
  • the two UMI molecular tags 20 are located on the first strand 11 and the second strand 12 respectively, and are bound by at least part of complementary base pairing.
  • the two UMI molecular tags 20 can be respectively the first UMI molecular tag and the second UMI molecular tag.
  • the first UMI molecular tag 20 can be located at Between the first sequencing primer sequence 111 and the first amplification primer sequence 112, the second UMI molecular tag 20 can be located between the second sequencing primer sequence and 121 and the second amplification primer sequence 122, and can be formed by partial base complementary pairing.
  • the connector 10 is the same as the single-ended UMI connector, and the forward and reverse strands cannot be tracked.
  • the first chain 11 is a forward chain (as shown in Figure 3, from left to right, the chain arranged from the 5' end to the 3' end), and the second chain 12 is a reverse chain (As shown in Fig. 3, from left to right is the chain arranged from the 3' end to the 5' end),
  • the UMI molecular tag 20 on the first chain 11 (that is, the first UMI molecular tag described above) is located on the first Downstream of the sequencing primer sequence 112
  • the UMI molecular tag 20 on the second strand 12 that is, the second UMI molecular tag described above) is located upstream of the second sequencing primer sequence 121.
  • the adapter 10 can also be called a double-end UMI adapter.
  • the chain and reverse strand are tracked at the same time, so that when the mutated base at the same position appears in both the forward and reverse strands, it can be marked as a real mutation, which can further improve the accuracy of detection.
  • the linker 10 further includes Index sequence 1 and Index sequence 2, Index sequence 1 is located on the second strand 12, Index sequence 2 is located on the first strand 11, Index sequence 1 and Index sequence Sequence 2 can label different samples.
  • an adapter ligation reagent including: various adapters 10, ligase, buffer, etc.
  • the various adapters 10 are the above-mentioned adapters 10, and the ligase can be exemplified by DNA ligase or RNA Ligase, whose role is to promote the ligation of various adapters 10 and DNA fragments after end repair, and the buffer provides a stable pH environment for the adapter ligation reaction.
  • the various adapters 10 at least one random base of at least one UMI molecular tag 20 contained in every two adapters 10 is different.
  • the various adapters 10 mentioned above are all UMI adapters, and at least one UMI molecular tag 20 contained in the UMI adapter includes at least one random base and at least one fixed base, and the random base is selected from different bases , therefore, different DNA molecules can be labeled through different UMI adapters, so that multiple different DNA molecules can be sequenced. When errors occur in amplification or sequencing, it is corrected, so that the introduction of false positive mutations can be reduced.
  • kits comprising an adapter ligation reagent as described above.
  • the kit may be an adapter ligation kit.
  • the kit refers to the box used to contain chemical reagents such as chemical components, drug residues, virus types, etc., and here refers to the box containing the reagents connected by the connector.
  • UMI molecular tag 20 includes at least one random base and at least one fixed base.
  • the gene may include a DNA molecule or an RNA molecule for expression of genetic information, and the UMI molecular tag 20 is configured to mark different DNA molecules or RNA molecules.
  • the gene may include cfDNA, and the UMI molecular tag 20 may be used in a UMI linker to mark different cfDNA molecules.
  • Some embodiments of the present disclosure provide a DNA or RNA library construction method, comprising:
  • the fragmented DNA can be obtained by mechanical fragmentation or enzymatic hydrolysis.
  • cDNA complementary DNA
  • fragmented DNA can be obtained by reverse transcription of mRNA, and fragmented DNA can be obtained after cDNA is interrupted.
  • some DNA is free DNA in blood, which itself is fragmented and can be obtained directly from blood, or can be obtained through commercial channels, such as cfDNA (Circulating Free DNA), cfDNA (Circulating Free DNA ) is a DNA that is in a free state outside the cell and in a cell-free state.
  • the KAPA Biosystem (also referred to as KAPA) kit can be used to repair the cfDNA and add A.
  • the end repair product is treated with the adapter ligation reagent as described above, and the adapter in the adapter ligation reagent reacts with the end repair product to obtain the adapter ligation product.
  • each end repair product can include a forward strand and a reverse strand, and one end repair product can be connected to two adapters 10, in each
  • the linker 10 includes a UMI molecular tag 20
  • the linker 10 is a single-end UMI linker, which can label different end repair products, but cannot track the pros and cons of the end repair products, and the linker
  • the front and back strands of the end repair product can be tracked, so that when the mutation base at the same position appears in both the front and back strands, it can be marked as a true mutation, which can further improve the detection accuracy.
  • Adapter ligation products are enriched to generate DNA or RNA libraries.
  • adapter ligation products can be enriched by PCR amplification.
  • the UMI molecular tag 20 in the adapter 10 includes at least one random base and at least one fixed base
  • the random base is selected from different bases
  • the UMI molecular tag 20 can be used to mark different DNA according to the difference of the random base molecule
  • the fixed base is selected from known fixed bases, which can be corrected when errors occur in the sequence to be tested and UMI molecular tag 20 itself during amplification or sequencing, thereby reducing the introduction of false positive mutations and improving detection accuracy.
  • Some embodiments of the present disclosure provide a gene sequencing detection method, comprising:
  • the DNA or RNA is sequenced using the DNA or RNA library obtained by the DNA or RNA library construction method described above.
  • DNA or RNA is sequenced by using the DNA or RNA library obtained by the DNA or RNA library construction method described above, because the DNA molecule or RNA in the DNA or RNA library constructed above Each molecule is connected with a linker 10, and the linker 10 contains a UMI molecular tag 20. Therefore, the DNA molecule or RNA molecule can be marked by the UMI molecular tag 20, and the fixed base pair sequencing or RNA molecule can be used in the subsequent sequencing process. Errors generated during the amplification process are corrected, thereby reducing the introduction of false positive mutations and improving detection accuracy.
  • kits comprising: the DNA or RNA library obtained by the DNA or RNA library construction method described above.
  • the kit can also include a targeted capture kit, which can include a targeted capture reagent, and the targeted capture reagent can perform targeted capture by hybridization, or can Targeted capture by means of multiplex PCR (which can occur prior to enrichment during library construction) allows sequencing of selected genes.
  • a targeted capture kit which can include a targeted capture reagent, and the targeted capture reagent can perform targeted capture by hybridization, or can Targeted capture by means of multiplex PCR (which can occur prior to enrichment during library construction) allows sequencing of selected genes.
  • the molecular tag 20 includes at least one random base N and at least one fixed base.
  • the two UMI molecular tags 20 may be located on the first strand 11 and the second strand 12 of the linker 10 , for details, refer to the description of the linker 10 including the two UMI molecular tags 20 , which will not be repeated here.
  • Some embodiments of the present disclosure provide a method for preparing an adapter 10, the adapter 10 including at least one UMI molecular tag 20, as shown in FIG. 5, the preparation method includes:
  • each UMI molecular tag 20 is located on the first strand 11 or the second strand 12, and at least one UMI molecular tag 20 includes at least one random base and at least one fixed base.
  • first strand 11 and the second strand 12 can be synthesized respectively by chemical synthesis method (ie, DNA synthesis method), instead of synthesizing the first strand 11 and the second strand 12 by biological synthesis method.
  • chemical synthesis method ie, DNA synthesis method
  • one chain (such as the first chain 11) and another chain (such as the second chain 12) that are not combined with the first chain can also be synthesized on the basis of the UMI molecular tag group.
  • the first strand 11 and the second strand 12 can be combined by partial complementary base pairing by specific annealing.
  • Step 1) Synthesizing the first strand 11 (the UMI molecular tag 20 contained in the first strand 11 is located downstream of the first sequencing primer sequence 112, including 3 random bases N, with 2 intervals between each two random bases N fixed base, and the end has a thio-modified T base) and the second strand 12 (the UMI molecular tag 20 contained in the second strand 12 is located upstream of the second sequencing primer sequence 121, including 3 random bases N , every two random bases N are separated by 2 fixed bases, and the ends are connected to phosphate groups), 64 each.
  • the sequence of the first strand 11 is shown in SEQ ID NO: 1 in the sequence listing, and the sequence of the second strand 12 is shown in SEQ ID NO: 2 in the sequence listing.
  • first chain 11 and the second chain 12 may also be shown in Table 1 below:
  • first chain 11 5'-aatgatacggcgaccaccgagatgtnnnnnnnacactctttccctacacgacgctcttccgatcnagcntagn-s-t-3' second chain 12 3'-g-s-ttcgtcttctgccgtatgctctannnnnncactgacctcaagtctgcacacgagaaggctagntcngan-p'-5'
  • N in the first strand 11 is selected from 4 different bases
  • 64 sequences of UMI molecular tags 20 in the first strand 11 and the second strand 12 there are 64 sequences of UMI molecular tags 20 in the first strand 11 and the second strand 12, and the 64 UMI molecular tags 20 The sequence of is shown in Table 2 below:
  • Step 2) Select the paired first strand 11 and the second strand 12 to be resuspended to 100uM respectively, and the volume is 100uL in the buffer reagent, the buffer reagent includes: 10mM Tris, so that the pH of the buffer reagent is 7.5, 2mM EDTA and 50 mM NaCl.
  • Step 3 Take 10 uL of the first strand 11, 10 uL of the second strand 12, and 80 uL of the buffer reagent in PCR tubes, mix well, and centrifuge briefly.
  • Step 4) Place the PCR tube in the PCR machine, set the program temperature to 95°C, and the reaction time to 10 minutes. After the reaction, turn off the PCR machine, and wait until the temperature drops to room temperature (about 2 hours, the room temperature is about 25 degrees), Remove the PCR tube.
  • Step 5 Take 1uL sample for quality inspection with automatic nucleic acid and protein analyzer (Qsep100). The results are shown in Figure 6.
  • the peaks of 70bp to 80bp are double-stranded junctions
  • LM Low Marker
  • the length is 20bp
  • UM It is an Upper Marker with a length of 1000bp.
  • LM and UM are used as references to mark the position of the double-stranded linker, and the synthesis efficiency of the linker can reach about 40%.
  • step 2 is basically the same as each step in embodiment 1, and will not be repeated here. The difference is that in step 1), the part of the UMI molecular tag in the first chain 11 and the second chain 12 The fixed bases are different.
  • Example 2 the sequence of the first strand 11 is shown in SEQ ID NO: 131 in the sequence listing, and the sequence of the second strand 12 is shown in SEQ ID NO: 132 in the sequence listing.
  • first chain 11 and the second chain 12 can also be shown in the following table 3:
  • first chain 11 5'-aatgatacggcgaccaccgagatctnnnnnnnacactctttccctacacgacgctcttccgatcnagcntagn-s-t-3' second chain 12 3'-g-s-ttcgtcttctgccgtatgctctannnnnncactgacctcaagtctgcacacgagaaggctagntcgnatcn-p'-5'
  • N in the first strand 11 is selected from 4 different bases
  • the sequence of is shown in Table 4 below:
  • step 3 is basically the same as each step in embodiment 1, and will not be repeated here. The difference is that in step 1), the part of the UMI molecular tag in the first chain 11 and the second chain 12 The fixed bases are different.
  • Example 3 the sequence of the first strand 11 is shown in SEQ ID NO: 261 in the sequence listing, and the sequence of the second strand 12 is shown in SEQ ID NO: 262 in the sequence listing.
  • first chain 11 and the second chain 12 may also be shown in Table 5 below:
  • first chain 11 5'-aatgatacggcgaccaccgagatctnnnnnnnacactctttccctacacgacgctcttccgatcnagctnagctn-s-t-3' second chain 12 3'-g-s-ttcgtcttctgccgtatgctctannnnnncactgacctcaagtctgcacacgagaaggctagntcgantcgan-p'-5'
  • N in the first strand 11 is selected from 4 different bases
  • 64 sequences of UMI molecular tags 20 in the first strand 11 and the second strand 12 there are 64 sequences of UMI molecular tags 20 in the first strand 11 and the second strand 12, and the 64 UMI molecular tags 20 The sequence of is shown in Table 6 below:
  • Step 1) Customize the cfDNA standard product of Jingliang Gene Company with multiple mutation sites as the sample.
  • the mutation frequency is 1% and 0.1%.
  • Step 2) Use the KAPA kit to perform end repair and A-tailing on the cfDNA.
  • Step 3 Use the KAPA kit and the linker synthesized in Example 1 to connect the linker to the cfDNA to obtain the linker ligation product.
  • Step 4) Amplify, enrich and purify the adapter ligation product to obtain a cfDNA library.
  • Step 5 Using a complete set of kits from IDT (Integrated DNA Technologies) to perform targeted capture on the adapter ligation products to obtain the adapter ligation products of the selected genes.
  • Step 6) Using the cfDNA library obtained in step 4) as a sample, use a Novaseq 6000 (Illumina) instrument to carry out on-machine sequencing according to the routine use of the instrument.
  • a Novaseq 6000 Illumina
  • Step 7) Use FastQC software to analyze the basic quality control of the off-machine data.
  • the actual detected sites and mutations are basically consistent with the theoretical values.
  • the specific detection results are shown in Table 7 and Table 8 below.
  • the actual detection mutation frequency of the different mutation sites of the selected gene in Experimental Example 1 is basically between 0.089% and 0.12%, which is relatively accurate compared with the theoretical mutation frequency (0.1%).
  • Experimental Example 2 The actual detection mutation frequency of different mutation sites of the selected gene is basically between 0.081% and 0.150%, which is also accurate compared with the theoretical mutation frequency.
  • Experimental example 3 is the actual detection of different mutation sites of the selected gene The mutation frequency is basically between 0.079% and 0.140%, which is more accurate compared with the theoretical mutation frequency.
  • the actual detection mutation frequency of different mutation sites of the selected gene in Experimental Example 1 is basically between 0.80% and 1.20%, which is more accurate compared with the theoretical mutation frequency (1%).
  • the actual detection mutation frequency of different mutation sites of a given gene is basically between 0.85% and 1.30%, which is also accurate compared with the theoretical mutation frequency.
  • the actual detection mutation frequency of different mutation sites of a selected gene in Experimental Example 3 It is basically between 0.78% and 1.25%, and it is more accurate compared with the theoretical mutation frequency.

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Genetics & Genomics (AREA)
  • Organic Chemistry (AREA)
  • Engineering & Computer Science (AREA)
  • Zoology (AREA)
  • Wood Science & Technology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Engineering & Computer Science (AREA)
  • Biotechnology (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Biochemistry (AREA)
  • Microbiology (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Biophysics (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Plant Pathology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Chemical Kinetics & Catalysis (AREA)
  • General Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Medicinal Chemistry (AREA)
  • Immunology (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

一种UMI分子标签,包括:至少一个随机碱基和至少一个固定碱基。

Description

UMI分子标签及其应用、接头、接头连接试剂及试剂盒和文库构建方法 技术领域
本公开涉及生物技术领域,尤其涉及一种UMI分子标签及其应用、接头、接头连接试剂及试剂盒和文库构建方法。
背景技术
下一代基因测序技术(Next Generation Sequencing,NGS,也称为二代测序技术)是目前应用最广的测序技术,具有测序深度高、通量大、准确率高、灵敏度好等优势。
发明内容
一方面,提供一种UMI分子标签,包括:至少一个随机碱基和至少一个固定碱基。
在一些实施例中,所述随机碱基和所述固定碱基中至少其中一个为多个;多个所述随机碱基和/或多个所述固定碱基连续排列;或者,多个随机碱基中至少有两个随机碱基间隔排列,和/或,多个固定碱基中至少有两个固定碱基间隔排列。
在一些实施例中,所述随机碱基为多个,且多个随机碱基中至少有两个随机碱基间隔排列,间隔排列的每两个随机碱基之间间隔1~5个固定碱基。
在一些实施例中,所述随机碱基为至少三个;所述至少三个随机碱基两两之间均间隔排列,间隔排列的每两个随机碱基之间间隔一组固定碱基,每两组固定碱基所包含的固定碱基的数量相同。
在一些实施例中,在每相邻的两组固定碱基中,其中一组固定碱基中至少有一个固定碱基与另一组固定碱基中的一个固定碱基不同。
在一些实施例中,间隔排列的每两个随机碱基之间间隔2个~4个固定碱基,且所述2个~4个固定碱基均不相同。
在一些实施例中,所述随机碱基为3个。
在一些实施例中,所述UMI分子标签包括7个~11个碱基。
另一方面,提供一种分子标签组,包括:两个UMI分子标签,所述两个UMI分子标签通过至少部分碱基互补配对而结合;其中,至少一个UMI分子标签为如上所述的UMI分子标签。
另一方面,提供一种接头,包括:第一链和第二链;以及至少一个UMI分子标签,每个UMI分子标签位于所述第一链或第二链上,所述至少一个UMI分子标签为如上所述的UMI分子标签。
在一些实施例中,所述UMI分子标签为两个,两个UMI分子标签分别位于第一链和第二链上,并通过至少部分碱基互补配对而结合。
在一些实施例中,所述第一链为正向链,所述第二链为反向链;所述第一链包括第一测序引物序列,所述第二链包括第二测序引物序列,位于所述第一链上的UMI分子标签位于所述第一测序引物序列的下游,位于所述第二链上的UMI分子标签位于所述第二测序引物序列的上游。
在一些实施例中,多种接头,所述多种接头为如上所述的接头;在所述多种接头中,每两种接头所包含的至少一个UMI分子标签的至少一个随机碱基不同。
另一方面,提供一种试剂盒,包括:如上所述的接头连接试剂。
另一方面,提供一种如上所述的UMI分子标签在基因测序中的应用。
在一些实施例中,所述基因包括用于遗传信息表达的DNA分子;所述UMI分子标签被配置为对不同的DNA分子进行标记。
另一方面,提供一种DNA的文库构建方法,包括:
获取片段化DNA;对片段化DNA进行末端修复加A,得到末端修复产物;采用如权利要求12所述的接头连接试剂对末端修复产物进行处理,使所述接头连接试剂中的接头与末端修复产物发生反应,得到接头连接产物;对接头连接产物进行富集,得到DNA文库。
另一方面,提供一种基因测序检测方法,包括:使用如上所述的DNA的文库构建方法所获得的DNA文库对DNA进行基因测序。
又一方面,提供一种试剂盒,包括:如上所述的DNA的文库构建方法所获得的DNA文库。
附图说明
为了更清楚地说明本公开中的技术方案,下面将对本公开一些实施例中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本公开的一些实施例的附图,对于本领域普通技术人员来讲,还可以根据这些附图获得其他的附图。此外,以下描述中的附图可以视作示意图,并非对本公开实施例所涉及的产品的实际尺寸、方法的实际流程、信号的实际时序等的限制。
图1为根据一些实施例的一种Y型接头的结构图;
图2为根据一些实施例的一种测序方法的流程图;
图3为根据一些实施例的另一种Y型接头的结构图;
图4为根据一些实施例的一种UMI分子标签组的结构图;
图5为根据一些实施例的接头的制备方法的流程图;
图6为根据一些实施例的用于检测实施例1、实施例2和实施例3的双链接头的合成效率的毛细管电泳峰图。
具体实施方式
下面将结合附图,对本公开一些实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本公开一部分实施例,而不是全部的实施例。基于本公开所提供的实施例,本领域普通技术人员所获得的所有其他实施例,都属于本公开保护的范围。
除非上下文另有要求,否则,在整个说明书和权利要求书中,术语“包括(comprise)”及其其他形式例如第三人称单数形式“包括(comprises)”和现在分词形式“包括(comprising)”被解释为开放、包含的意思,即为“包含,但不限于”。在说明书的描述中,术语“一个实施例(one embodiment)”、“一些实施例(some embodiments)”、“示例性实施例(exemplary embodiments)”、“示例(example)”、“特定示例(specific example)”或“一些示例(some examples)”等旨在表明与该实施例或示例相关的特定特征、结构、材料或特性包括在本公开的至少一个实施例或示例中。上述术语的示意性表示不一定是指同一实施例或示例。此外,所述的特定特征、结构、材料或特点可以以任何适当方式包括在任何一个或多个实施例或示例中。
以下,术语“第一”、“第二”仅用于描述目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括一个或者更多个该特征。在本公开实施例的描述中,除非另有说明,“多个”的含义是两个或两个以上。
“A、B和C中的至少一个”与“A、B或C中的至少一个”具有相同含义,均包括以下A、B和C的组合:仅A,仅B,仅C,A和B的组合,A和C的组合,B和C的组合,及A、B和C的组合。
“A和/或B”,包括以下三种组合:仅A,仅B,及A和B的组合。
本文中“适用于”或“被配置为”的使用意味着开放和包容性的语言,其不排除适用于或被配置为执行额外任务或步骤的设备。
如本文所使用的那样,术语“DNA”是脱氧核糖核酸(DeoxyriboNucleic Acid)的简称。DNA是存在于生物细胞的遗传信息载体,在体内的作用主要是引导RNA和蛋白质的合成。DNA是由脱氧核苷酸组成的大分子聚合物,脱氧核苷酸由磷酸、脱氧核糖和碱基构成;其中,碱基主要有4种,即A(腺嘌呤)、G(鸟嘌呤)、C(胞嘧啶)和T(胸腺嘧啶)。
如本文所使用的那样,术语“RNA”是核糖核酸(Ribonucleic Acid)的简称。RNA是存在于生物细胞以及部分病毒、类病毒中的遗传信息载体,在体内的作用主要是引导蛋白质的合成。RNA是由核糖核苷酸组成的大分子聚合物,核糖核苷酸由磷酸、核糖和碱基构成;其中,碱基主要有4种,即A(腺嘌呤)、G(鸟嘌呤)、C(胞嘧啶)和U(尿嘧啶)。
目前,二代测序技术广泛应用于生殖遗传、肿瘤检测等领域,尤其是在液体活检方面,在其文库制备过程中,所用的PCR(Polymerase Chain Reaction,聚合酶链式反应)扩增酶的复制具有一定的碱基错误率,同时在测序过程中,测序仪读取碱基的错误率为0.01%~0.1%(即每1000个碱基就会出现1个~10个错误碱基),这些噪音突变(也称为非本源突变)会出现在低频或超低频突变的样本中,导致很难区分频率在1%及以下的突变是真实的基因突变还是测序或PCR错误导致的噪音突变。由于检测这些低频突变意义重大,因此,在原始的DNA片段中引入UMI分子标签(Unique Molecular indentifier),UMI(Unique Molecular indentifier)分子标签又称分子条形码,它的原理就是给每一条原始的DNA片段加上一段特有的标签序列,经文库构建及PCR扩增后一起进行测序。这样,根据不同的标签序列我们就可以区分不同来源的DNA模板(后续均称为DNA分子),分辨哪些是PCR扩增及测序过程中的随机错误造成的假阳性突变,哪些是患者真正携带的突变,从而提高检测灵敏度和特异性。具体的,该UMI分子标签对原始的DNA片段进行标记,来源于不同的DNA分子带上了不同的分子标签,分析测序结果时筛选出相同的插入片段(也即原始的DNA片段),若同一个插入片段的两端带上具有互补配对的UMI接头,即可用UMI接头标记同一个插入片段的正反链(正向链和反向链),如果同一位置的突变碱基在正反链中都出现,则标记为真实突变,通过这种方式还原其原始突变状态。
目前,引入UMI分子标签主要有两种策略:第一,单端UMI分子标签的引入,例如可以在接头的P5端加入8个随机碱基代替Index,此方法合成的接头具有简单、经济适用的优点,得到了广泛的应用,但是在建库过程中,UMI接头随机连接,会导致一段原始的DNA片段连上两个不同的UMI接头,导致正反链上连上不同的UMI标记,所以无法跟踪原始的正反链信息,不能准确地对正反链序列进行校正,并且如果UMI序列发生碱基突变则会导致原始的DNA片段碱基数增多,引入潜在的假阳性突变。第二、双端UMI分子标签的引入,也即在相关技术中,首先合成单链的接头(序列),单链的接头(序列)包括第一序列和第二序列,其中,第二序列包括限制性内切酶的 保护碱基、随机碱基的双链分子标签,其次通过对单链的接头序列进行退火形成双链接头,最后通过酶切即可得到3’-dT尾的接头,这样的双链接头虽可以有效解决单端UMI无法跟踪原始正反链的问题,但是UMI序列本身发生突变时,也会引入假阳性突变。
本公开的一些实施例提供一种接头10,如图1所示,接头10包括:第一链11和第二链12,以及至少一个UMI分子标签20,每个UMI分子标签20位于第一链11或第二链12上。
其中,以接头10为Y型接头为例,根据能否匹配PCR-free文库,接头10可以分为长接头(完整的Y型)和短接头(不完整的Y型接头)。长接头通过TA连接的方式连接到待测DNA片段(也即如上所述的原始的DNA片段)两端,在文库产量足够的情况下,可不进行PCR扩增直接上机测序;而短接头通过TA连接的方式连接到待测DNA片段两端后,必须使用与短接头互补的Indexing Primers进行PCR扩增成为完整接头后,才能上机测序。短接头和长接头不同的主要原因是长短接头引入Index序列的方式不同。Index序列被配置为对不同的待测序列的样本进行标记,一个样本可以包括成千上万的DNA分子,UMI分子标签20则用于对同一个样本或不同的样本中的不同的DNA分子进行标记。
其中,以上述接头10为长接头为例,该接头10可以分为单端Index接头和双端Index接头,单端Index接头只在P7端存在Index序列,双端Index接头在P5和P7两端均存在Index序列。
在此,以接头10为单端Index接头为例,在上述接头10包括一个UMI分子标签20的情况下,如图1所示,该UMI分子标签20可以加入P7端代替Index序列,此时,第一链11可以包括从5’端依次的第一PCR扩增引物111(也就是通常所说的P5)和第一测序引物序列112(R1SP),第二链12可以包括从5’端依次的第二测序引物序列121(R2SP)、UMI分子标签20和第二PCR扩增引物122(也就是通常所说的P7)。也即,该接头10为单端UMI的接头。
在一些实施例中,上述至少一个UMI分子标签20包括至少一个随机碱基和至少一个固定碱基。
其中,对一个UMI分子标签20中随机碱基和固定碱基的数量和排列方式均不做具体限定。
在一些实施例中,以随机碱基和固定碱基均为一个为例,随机碱基和固定碱基可以沿同一方向依次排列,例如:随机碱基和固定碱基按照从UMI序 列的5’端到3’端的方向依次先后排列,或者,随机碱基和固定碱基按照从UMI序列的3’端到5’端的方向依次先后排列。
在另一些实施例中,以随机碱基和固定碱基中至少其中一个为多个为例,有两种可能的情况。
第一种情况,随机碱基和/或固定碱基为多个,多个随机碱基和/或多个固定碱基连续排列。
在此情况下,根据随机碱基和固定碱基分别为一个还是多个,有多种可能的情形,第一种情形,随机碱基为多个,固定碱基为一个,在此情形下,多个随机碱基连续排列,此时,固定碱基可以位于多个随机碱基的一侧(如将从UMI序列的5’端到3’端的方向称为第一方向,将从UMI序列的3’端到5’端的方向称为第二方向,固定碱基可以位于多个随机碱基的第一方向或第二方向的一侧)。第二种情形,固定碱基为多个,随机碱基为一个,在此情形下,随机碱基可以位于多个固定碱基的一侧(如将从UMI序列的5’端到3’端的方向称为第一方向,将从UMI序列的3’端到5’端的方向称为第二方向,随机碱基可以位于多个固定碱基的第一方向或第二方向的一侧)。第三种情形,随机碱基和固定碱基均为多个,在此情形下,多个随机碱基和多个固定碱基均连续排列,此时,多个固定碱基可以位于多个随机碱基的一侧(如将从UMI序列的5’端到3’端的方向称为第一方向,将从UMI序列的3’端到5’端的方向称为第二方向,多个固定碱基可以位于多个随机碱基的第一方向或第二方向的一侧)。
第二种情况,随机碱基和/或固定碱基为多个,且多个随机碱基中至少有两个随机碱基间隔排列,和/或,多个固定碱基中至少有两个固定碱基间隔排列。
在此情况下,根据随机碱基和固定碱基分别为一个还是多个,有多种可能的情形,第一种情形,随机碱基为多个,固定碱基为一个,在此情形下,固定碱基位于多个随机碱基中任意两个相邻的随机碱基之间。第二种情形,固定碱基为多个,随机碱基为一个,在此情形下,随机碱基位于多个固定碱基中任意两个相邻的固定碱基之间。第三种情形,随机碱基和固定碱基均为多个,在此情形下,有多种可能的排列方式,第一种排列方式,多个随机碱基中至少有两个随机碱基间隔排列,多个固定碱基位于任意两个间隔的随机碱基之间。第二种排列方式,多个固定碱基中至少有两个固定碱基间隔排列,多个随机碱基位于任意两个间隔的固定碱基之间。第三种排列方式,多个随机碱基中至少有两个随机碱基间隔排列,多个固定碱基中至少有两个固定碱 基间隔排列,这时,多个随机碱基和多个固定碱基均至少有两个碱基间隔排列,且两个间隔的随机碱基之间可以间隔一个或多个固定碱基,两个间隔排列的固定碱基之间也可以间隔一个或多个随机碱基。
其中,以随机碱基为多个,且多个随机碱基中至少有两个随机碱基间隔排列为例,间隔排列的每两个随机碱基之间可以间隔1~5个固定碱基。
其中,需要说明的是,随机碱基顾名思义就是碱基是随机的,可以选自4个碱基(A、T、C和G)中的任意一个,可以用N来表示。随机碱基选自不同的碱基,可以用于标记不同的DNA分子。
示例的,以一个UMI分子标签20中,随机碱基为一个为例,该UMI分子标签20中的N可以选自4个碱基中的任一个,这时,根据UMI分子标签20中的N不同,可以得到4种UMI分子标签,这4种UMI分子标签20可以做成4 2个(也即16个)接头(一个DNA分子连接两个接头),从而可以对4 2个(也即16个)不同的DNA分子进行标记,进而完成对4 2个(也即16个)不同的DNA分子的检测。
以一个UMI分子标签20中,随机碱基为3个为例,该UMI分子标签20中的每个N均可以选自4个碱基中的任一个,这时,根据UMI分子标签20中的3个N分别有4 3种(也即64种)组合,可以得到4 3种(也即64种)UMI分子标签20,这64种UMI分子标签20可以做成64 2个(也即4096个)接头(一个DNA分子连接两个接头),从而可以对64 2个(也即4096个)不同的DNA分子进行标记,进而完成对64 2个(也即4096个)不同的DNA分子的检测。
由此可见,随着随机碱基的个数越多,UMI分子标签的种类就越多,其所能够标记的DNA分子的数量也就越多。
固定碱基选自固定的已知碱基,用于在待侧序列和UMI分子标签本身在扩增或测序发生错误时,对其进行校正,以减少引入假阳性突变。
具体的,如图2所示,以原始的DNA片段为100条,且起始位置和终止位置相同(也即序列相同),分别记为原始序列1、原始序列2、原始序列3、…、原始序列99和原始序列100,其中,原始序列2是发生了突变的序列,真实的突变频率为1%为例,原始的DNA片段分别连上一个不同的UMI接头,得到对应原始序列1~原始序列100的序列,仍然记为原始序列1、原始序列2、原始序列3、…、原始序列99和原始序列100,对这100个连接有UMI接头的原始序列进行PCR扩增富集,得到DNA文库,该DNA文库包括100条连接有UMI接头的原始序列1(为了区分,把复制得到的其余99条连接有UMI 接头的原始序列1记为原始序列1’),其中,如图2中第一种情况,对于原始序列1而言,通过UMI接头上的AAGCT可以判断出99条连接有UMI接头的原始序列1’是通过PCR扩增复制而来,由于原始序列1检测位点是A碱基,理论上复制出来的99条原始序列1’也应该为A碱基,但是,若第100条原始序列1’为C碱基,则可以判断这是PCR扩增错误或测序错误导致的噪音突变。而在第100条原始序列1’的DNA序列和UMI接头同时出现扩增错误的情况下,若UMI分子标签20为随机碱基组成的分子标签,则会判断为DNA序列和UMI接头均为真实突变,导致引入假阳性突变,而在本公开的实施例中,如图2中第二种情况,由于这5个碱基的UMI分子标签20的中间碱基固定为G碱基,根据这5个碱基的UMI分子标签20为AAGCT,而不是AATCT,即可判定该第100条原始序列1’中的UMI分子标签20也是PCR扩增或测序引入的噪音突变,而根据这5个碱基的UMI分子标签20标记的其余99条原始序列1’中的DNA序列均没有发生突变,则可以判定该第100条原始序列1’中的DNA序列也是PCR扩增或测序引入的噪音突变。
由此可见,通过采用部分固定碱基的UMI分子标签20,既可以保证接头的多样性,标记不同的原始的DNA片段,又能够在一定程度上排除PCR扩增或测序引入的噪音突变,从而可以提高检测准确性。
在本公开的一些实施例中,随机碱基为至少三个,至少三个随机碱基两两之间均间隔排列,间隔排列的每两个随机碱基之间间隔一组固定碱基,每两组固定碱基所包含的固定碱基的数量相同。
在这些实施例中,通过将随机碱基的数量限定为至少三个,可以保证UMI分子标签20对至少4096个不同的DNA分子进行标记,提高待检测分子的数量,从而提高样本的检测准确性,同时,通过使间隔排列的每两个随机碱基之间间隔一组固定碱基,每两组固定碱基所包含的固定碱基的数量相同,可以提高随机碱基和固定碱基排列的规律性,从而更容易识别出是固定碱基出现的突变,还是原始的DNA片段本身的突变,减少引入假阳性突变,提高检测准确性。另外,通过测试发现,在UMI分子标签20的碱基数量相同的情况下,多个随机碱基两两之间均间隔排列,与多个随机碱基连续排列相比,检测准确性更高。
在另一些实施例中,由于固定碱基起到排除PCR扩增或测序引入的噪音突变的作用,因此,在一个UMI分子标签20中,随着固定碱基的数量越多,检测时的容错性越好,越能够提高检测准确性,在此,以UMI分子标签20包括3个随机碱基和4个固定碱基为例,在后续测序时,UMI分子标签20有 4个固定碱基进行容错,容错率可以为4除以7乘以100%,约为57%。
然而,考虑到随着固定碱基的数量增多,会导致后续占用测序数据量,因此,固定碱基也并不是越多越好。
基于此,在一些实施例中,上述间隔排列的每两个随机碱基之间间隔2个~4个固定碱基。
在这些实施例中,通过在每两个随机碱基之间间隔2个~4个固定碱基,可以在保证检测的容错率的情况下,防止固定碱基数量过多而造成后续占用测序数据量。
其中,2个~4个固定碱基可以相同或不同,在此不做具体限定。
在一些实施例中,2个~4个固定碱基均不相同。
在这些实施例中,由于这2个~4个固定碱基均不相同,因此,能够在测序时,防止同一种荧光(标记同一种碱基)集中(碱基种类相同容易造成同一种荧光集中),从而可以防止荧光集中造成读取不准确的问题,提高检测准确性。
在另一些实施例中,在每相邻的两组固定碱基中,其中一组固定碱基中至少有一个固定碱基与另一组固定碱基中的一个固定碱基不同。
示例的,以一个UMI分子标签20中,随机碱基为3个,3个随机碱基两两间隔排列,每两个随机碱基之间均间隔1个固定碱基(也即相邻的两组固定碱基均包括1个固定碱基)为例,由于每相邻的两组固定碱基中,其中一组固定碱基中至少有一个固定碱基与另一组固定碱基中的一个固定碱基不同,因此,该UMI分子标签的序列可以表示如下:
NN1NN2N
其中,N1和N2不同,分别选自A、T、C和G中的任一种,3个N可以相同或不同,分别独立地选自A、T、C和G中的任一种。
也即,以N1选自A,N2选自C为例,相邻的两组固定碱基分别为A和C,且这两个固定碱基不同。
与N1和N2选自相同的碱基相比,可以在测序时,防止同一种荧光(标记同一种碱基)集中(碱基种类相同容易造成同一种荧光集中),从而可以避免荧光集中造成读取不准确的问题,提高检测准确性。
再示例的,以一个UMI分子标签20中,随机碱基为3个,3个随机碱基两两间隔排列,每两个随机碱基之间均间隔2个固定碱基(也即相邻的两组固定碱基均包括2个固定碱基)为例,由于每相邻的两组固定碱基中,其中一组固定碱基中至少有一个固定碱基与另一组固定碱基中的一个固定碱基不 同,因此,该UMI分子标签的序列可以表示如下:
NN3N4NN5N6N
其中,N3和N4可以相同或不同,分别独立地选自A、T、C和G中的任一种,N5和N6相同或不同,分别独立地选自A、T、C和G中的任一种,且N3和N4中至少有一者与N5和N6中的任一者均不同,3个N相同或不同,分别独立地选自A、T、C和G中的任一种。
这时,根据N3和N4是否相同,可以有两种可能的情况,第一种情况,N3和N4相同,这时,根据N5和N6是否相同,可以有两种可能的情形,第一种情形,N5和N6不同,此时,N5和N6中至少有一个固定碱基与N3和N4中的一个固定碱基不同,是指,N5和N6中有一个固定碱基与N3和N4中的两个固定碱基均不相同,或者,N5和N6中两个固定碱基与N3和N4中的两个固定碱基均不相同,其中,在N5和N6中有一个固定碱基与N3和N4中的两个固定碱基均不相同的情况下,以N3和N4均选自A为例,N5和N6可以分别选自A和C、A和G,或者A和T,在N5和N6分别选自A和C的情况下,相邻的两组固定碱基分别为AA和AC,这两组固定碱基中,其中一组固定碱基中的一个固定碱基(C)与另一组固定碱基中的两个固定碱基(A)不同;在N5和N6分别选自A和G的情况下,相邻的两组固定碱基分别为AA和AG,这两组固定碱基中,其中一组固定碱基中的一个固定碱基(G)与另一组固定碱基中的两个固定碱基(A)不同;在N5和N6分别选自A和T的情况下,相邻的两组固定碱基分别为AA和AT,这两组固定碱基中,其中一组固定碱基中的一个固定碱基(T)与另一组固定碱基中的两个固定碱基(A)不同。在N5和N6中两个固定碱基与N3和N4中的两个固定碱基均不相同的情况下,仍然以N3和N4均选自A为例,N5和N6可以分别选自C和G、C和T,或者G和T,在N5和N6分别选自C和G的情况下,相邻的两组固定碱基分别为AA和CG,这两组固定碱基中,其中一组固定碱基中的两个固定碱基(C和G)与另一组固定碱基中的两个固定碱基(A)均不相同;在N5和N6分别选自C和T的情况下,相邻的两组固定碱基分别为AA和CT,这两组固定碱基中,其中一组固定碱基中的两个固定碱基(C和T)与另一组固定碱基中的两个固定碱基(A)均不相同;在N5和N6分别选自G和T的情况下,相邻的两组固定碱基分别为AA和GT,这两组固定碱基中,其中一组固定碱基中的两个固定碱基(G和T)与另一组固定碱基中的两个固定碱基(A)也均不相同。第二种情形,N5和N6相同,此时,N5和N6中至少有一个固定碱基与N3和N4中的一个固定碱基不同,是指,N5和N6中 两个固定碱基与N3和N4中的两个固定碱基均不相同,示例的,仍然以N3和N4均选自A为例,N5和B6可以均选自T、G或C,在N5和N6均选自T的情况下,N5和N6中的两个固定碱基(T)与N3和N4中的两个固定碱基(A)均不相同,在N5和N6均选自G的情况下,N5和N6中的两个固定碱基(G)与N3和N4中的两个固定碱基(A)均不相同,在N5和N6均选自C的情况下,N5和N6中的两个固定碱基(C)与N3和N4中的两个固定碱基(A)均不相同。
第二种情况,N3和N4不同,这时,根据N5和N6是否相同,可以有两种可能的情形,第一种情形,N5和N6不同,此时,N5和N6中至少有一个固定碱基与N3和N4中的一个固定碱基不同,是指,N5和N6中有一个固定碱基与N3和N4中的一个固定碱基不同,或者,N5和N6中两个固定碱基与N3和N4中的两个固定碱基均不相同,其中,在N5和N6中有一个固定碱基与N3和N4中的一个固定碱基不同的情况下,以N3和N4分别选自A和T为例,N5和N6可以分别选自A和C、A和G、T和C,或者T和G等,在N5和N6分别选自A和C的情况下,N5和N6中一个固定碱基(C)与N3和N4中一个固定碱基(T)不同,在N5和N6分别选自A和G的情况下,N5和N6中一个固定碱基(G)与N3和N4中一个固定碱基(T)不同,在N5和N6分别选自T和C的情况下,N5和N6中一个固定碱基(C)与N3和N4中一个固定碱基(T)不同,在N5和N6分别选自T和G的情况下,N5和N6中一个固定碱基(G)与N3和N4中一个固定碱基(T)不同。在N5和N6中两个固定碱基与N3和N4中的两个固定碱基均不相同的情况下,仍然以N3和N4分别选自A和T为例,N5和N6可以分别选自G和C,在N5和N6分别选自G和C的情况下,N5和N6中两个固定碱基(G和C)与N3和N4中的两个固定碱基(A和T)均不相同。第二种情形,N5和N6相同,此时,N5和N6中至少有一个固定碱基与N3和N4中的一个固定碱基不同,是指,N5和N6中两个固定碱基与N3和N4中的一个或两个固定碱基不同,示例的,仍然以N3和N4分别选自A和T为例,N5和N6可以均选自A、T、C或G,在N5和N6均选自A的情况下,N5和N6中两个固定碱基与N3和N4中的一个固定碱基不同,在N5和N6均选自T的情况下,N5和N6中两个固定碱基与N3和N4中的一个固定碱基不同,在N5和N6均选自C的情况下,N5和N6中两个固定碱基与N3和N4中的两个固定碱基均不相同,在N5和N6均选自G的情况下,N5和N6中两个固定碱基与N3和N4中的两个固定碱基也均不相同。
在这些实施例中,与上述每两个随机碱基之间均间隔1个固定碱基(也即相邻的两组固定碱基均包括1个固定碱基)相类似地,同样能够在测序时,防止同一种荧光(标记同一种碱基)集中(碱基种类相同容易造成同一种荧光集中),从而可以防止荧光集中造成读取不准确的问题,提高检测准确性。
在一些实施例中,随机碱基为3个。
在这些实施例中,将随机碱基限定为3个,可以对4096个不同的DNA分子进行标记,从而可以满足应用需求。
在一些实施例中,UMI分子标签20包括7个~11个碱基。
在这些实施例中,通过将UMI分子标签20所包含的碱基数量限定为7个~11个,可以避免UMI分子标签20过长后续占用测序数据,以及UMI分子标签20过短不利于提高容错率(如固定碱基数量过少)和/或不利于对较多的DNA分子进行标记(如随机碱基数量过少)的问题。
在一些实施例中,如图3所示,UMI分子标签20为两个,两个UMI分子标签20分别位于第一链11和第二链12上,并通过至少部分碱基互补配对而结合。
在这些实施例中,两个UMI分子标签20可以分别为第一UMI分子标签和第二UMI分子标签,此时,有两种可能的情况,第一种情况,第一UMI分子标签20可以位于第一测序引物序列111和第一扩增引物序列112之间,第二UMI分子标签20可以位于第二测序引物序列和121第二扩增引物序列122之间,并通过部分碱基互补配对而结合,此时,该接头10与单端UMI接头相同,也无法对正向链和反向链进行跟踪。第二种情况,如图3所示,第一链11为正向链(如图3中从左到右为从5’端到3’端排列的链),第二链12为反向链(如图3中从左到右为从3’端到5’端排列的链)、位于第一链11上的UMI分子标签20(也即上述所述的第一UMI分子标签)位于第一测序引物序列112的下游,位于第二链12上的UMI分子标签20(也即上述所述的第二UMI分子标签)位于第二测序引物序列121的上游,此时,第一UMI分子标签和第二UMI分子标签通过全部的碱基互补配对而结合,在此情况下,该接头10也可以称为双端UMI的接头,与单端UMI的接头相比,还能够对待测序列的正向链和反向链同时进行跟踪,从而可以在同一位置的突变碱基在正反链中都出现时标记为真实突变,能够进一步提高检测的准确性。
在另一些实施例中,如图3所示,接头10还包括Index序列1和Index序列2,Index序列1位于第二链12上,Index序列2位于第一链11上,Index序列1和Index序列2可以对不同的样本进行标记。
本公开的一些实施例提供一种接头连接试剂,包括:多种接头10、连接酶、缓冲液等,多种接头10为如上所述的接头10,连接酶示例的可以是DNA连接酶或RNA连接酶,其作用是促使多种接头10和进行末端修复后的DNA片段连接,缓冲液为接头连接反应提供稳定的pH环境。在多种接头10中,每两种接头10所包含的至少一个UMI分子标签20的至少一个随机碱基不同。
在这些实施例中,上述多种接头10均为UMI接头,该UMI接头所包含的至少一个UMI分子标签20包括至少一个随机碱基和至少一个固定碱基,随机碱基选自不同的碱基,因此,通过不同的UMI接头可以对不同的DNA分子进行标记,从而实现对多个不同的DNA分子进行测序,固定碱基选自已知的固定的碱基,可以在待测序列和UMI分子标签本身在扩增或测序发生错误时,对其进行校正,从而可以减少引入假阳性突变。
本公开的一些实施例提供一种试剂盒,包括如上所述的接头连接试剂。
也即,该试剂盒可以是接头连接试剂盒。试剂盒是指用于盛放检测化学成分、药物残留、病毒种类等化学试剂的盒子,在此则是指盛放有接头连接试剂的盒子。
本公开的实施例提供的试剂盒的有益技术效果和本公开的实施例提供的接头的有益技术效果相同,在此不再赘述。
本公开的一些实施例提供一种UMI分子标签20在基因测序中的应用,该UMI分子标签20包括至少一个随机碱基和至少一个固定碱基。
在一些实施例中,该基因可以包括用于遗传信息表达的DNA分子或RNA分子,UMI分子标签20被配置为对不同的DNA分子或RNA分子进行标记。
示例的,该基因可以包括cfDNA,该UMI分子标签20可以用于UMI接头中,对不同的cfDNA分子进行标记。
本公开的一些实施例提供一种DNA或RNA的文库构建方法,包括:
获取片段化DNA。
其中,获取片段化DNA可以通过机械打断的方式或酶解的方式进行获取。
当然,在获取片段化DNA之前,可以采用mRNA反转录得到cDNA(complementary DNA),在将cDNA打断之后得到片段化的DNA。
在一些实施例中,某些DNA为血液中的游离DNA,本身就是片段化的,可以直接从血液中获取,或者,可以通过商业途径获取,如cfDNA(Circulating Free DNA),cfDNA(Circulating Free DNA)是一种在细胞外呈现游离状态且无细胞状态的的DNA。
对片段化DNA或RNA进行末端修复加A,得到末端修复产物。
示例的,以片段化DNA为cfDNA为例,可以采用KAPA Biosystem(也可以简称为KAPA)试剂盒对cfDNA进行末端修复加A。
采用如上所述的接头连接试剂对末端修复产物进行处理,使接头连接试剂中的接头与末端修复产物发生反应,得到接头连接产物。
也即,利用上述包含有多种接头的接头连接试剂对末端修复产物连接接头,每个末端修复产物可以包括正向链和反向链,一个末端修复产物可以连接两个接头10,在每个接头10包括一个UMI分子标签20的情况下,该接头10为单端UMI的接头,可以对不同的末端修复产物进行标记,但是,并不能对末端修复产物的正反链进行跟踪,而在接头10为双端UMI的接头的情况下,则可以对末端修复产物的正反链进行跟踪,从而可以在同一位置的突变碱基在正反链中都出现时标记为真实突变,能够进一步提高检测的准确性。
对接头连接产物进行富集,得到DNA或RNA文库。
示例的,可以通过PCR扩增对接头连接产物进行富集。
由于上述接头10中的UMI分子标签20包括至少一个随机碱基和至少一个固定碱基,随机碱基选自不同的碱基,根据随机碱基的不同,可以利用UMI分子标签20标记不同的DNA分子,而固定碱基选自已知的固定的碱基,可以在待测序列和UMI分子标签20本身在扩增或测序发生错误时,对其进行校正,从而可以减少引入假阳性突变,提高检测准确性。
本公开的一些实施例提供一种基因测序检测方法,包括:
使用如上所述的DNA或RNA的文库构建方法所获得的DNA或RNA文库对DNA或RNA进行基因测序。
在本公开的实施例中,通过采用如上所述的DNA或RNA的文库构建方法所获得的DNA或RNA文库对DNA或RNA进行基因测序,由于上述构建的DNA或RNA文库中的DNA分子或RNA分子均连接有接头10,而接头10中包含有UMI分子标签20,因此,通过UMI分子标签20即可对DNA分子或RNA分子进行标记,可以在后续测序过程中,利用固定碱基对测序或扩增过程中所产生的错误进行校正,从而可以减少引入假阳性突变,提高检测准确性。
本公开的一些实施例提供一种试剂盒,包括:如上所述的DNA或RNA的文库构建方法所获得的DNA或RNA文库。
当然,在一些实施例中,该试剂盒还可以包括靶向捕获试剂盒,该靶向捕获试剂盒可以包括靶向捕获试剂,该靶向捕获试剂可以通过杂交的方式进 行靶向捕获,也可以通过多重PCR的方式(可以发生在文库构建过程中的富集之前)进行靶向捕获,均可以对一些选定的基因进行测序。
本公开的一些实施例提供一种UMI分子标签组,如图4所示,包括:两个UMI分子标签20,两个UMI分子标签20通过至少部分碱基互补配对而结合,其中,至少一个UMI分子标签20包括至少一个随机碱基N和至少一个固定碱基。
也即,这两个UMI分子标签20可以位于上述接头10的第一链11和第二链12上,具体可参照上述对接头10包括两个UMI分子标签20的描述,在此不再赘述。
本公开的一些实施例提供一种接头10的制备方法,该接头10包括至少一个UMI分子标签20,如图5所示,该制备方法包括:
S1)合成第一链11和第二链12,每个UMI分子标签20位于第一链11或第二链12上,至少一个UMI分子标签20包括至少一个随机碱基和至少一个固定碱基。
示例的,可以通过化学合成的方法(也即DNA合成法)分别合成上述第一链11和第二链12,而不是生物合成的方法合成第一链11和第二链12。
当然,在得到上述UMI分子标签组的情况下,也可以在UMI分子标签组的基础上合成一条链(如第一链11)和另一条链(如第二链12)中的未与第一链11互补配对的部分,然后再通过碱基互补配对的方法合成第二链12上与第一链11互补配对的部分。
S2)对第一链11和第二链12进行退火,得到接头10。
也即,在通过上述合成第一链11和第二链12这两条单链的情况下,可以通过特异性退火,使第一链11和第二链12通过部分碱基互补配对结合。
为了对本公开的实施例的技术效果进行客观评价,本公开的实施例将通过如下实施例和实验例对本公开进行详细地示例性地描述。
1、接头合成:
实施例1
步骤1)合成第一链11(第一链11中包含的UMI分子标签20位于第一测序引物序列112的下游,包括3个随机碱基N,每两个随机碱基N之间间隔2个固定碱基,且末端带有硫代修饰的T碱基)和第二链12(第二链12中包含的UMI分子标签20位于第二测序引物序列121的上游,包括3个随机碱基N,每两个随机碱基N之间间隔2个固定碱基,且末端连接磷酸基团),各64条。
第一链11的序列如序列表中SEQ ID NO:1所示,第二链12的序列如序列表中SEQ ID NO:2所示。
其中,第一链11和第二链12也可以如下表1所示:
表1
第一链11 5'-aatgatacggcgaccaccgagatgtnnnnnnnnacactctttccctacacgacgctcttccgatcnagcntagn-s-t-3'
第二链12 3'-g-s-ttcgtcttctgccgtatgctctannnnnnnncactgacctcaagtctgcacacgagaaggctagntcngan-p’-5'
在第一链11中的N选自4个不同的碱基的情况下,上述第一链11和第二链12中的UMI分子标签20的序列各有64条,该64条UMI分子标签20的序列如下表2所示:
表2
Figure PCTCN2021134159-appb-000001
Figure PCTCN2021134159-appb-000002
步骤2)选择配对的第一链11和第二链12分别重悬至100uM,体积是100uL的缓冲液试剂中,缓冲液试剂包括:10mM的Tris,使得缓冲液试剂的pH为7.5,2mM的EDTA和50mM的NaCl。
步骤3)分别取10uL的第一链11,10uL的第二链12和80uL缓冲液试剂于PCR管中,充分混匀,并短暂离心。
步骤4)放置PCR管于PCR仪中,设置程序温度为95℃,反应时间为10分钟,反应结束后,关掉PCR仪,待温度降至室温(约降温2h,室温约为25度),取出PCR管。
步骤5)取1uL样本进行全自动核酸蛋白分析仪(Qsep100)质检,结果 如图6所示,在图6中70bp~80bp的峰为双链接头,LM是Low Marker,长度为20bp,UM是Upper Marker,长度为1000bp,LM和UM作为参照对双链接头的位置进行标记,接头的合成效率大约可以达到40%。
实施例2
实施例2中的各步骤与实施例1的各步骤基本相同,在此不再赘述,不同的是,在步骤1)中,第一链11和第二链12中的UMI分子标签中的部分固定碱基不同。
在实施例2中,第一链11的序列如序列表中SEQ ID NO:131所示,第二链12的序列如序列表中SEQ ID NO:132所示。
其中,第一链11和第二链12也可以如下表3所示:
表3
第一链11 5'-aatgatacggcgaccaccgagatctnnnnnnnnacactctttccctacacgacgctcttccgatcnagcntagn-s-t-3'
第二链12 3'-g-s-ttcgtcttctgccgtatgctctannnnnnnncactgacctcaagtctgcacacgagaaggctagntcgnatcn-p’-5'
在第一链11中的N选自4个不同的碱基的情况下,上述第一链11和第二链12中的UMI分子标签20的序列各有64条,该64条UMI分子标签20的序列如下表4所示:
表4
Figure PCTCN2021134159-appb-000003
Figure PCTCN2021134159-appb-000004
实施例3
实施例3中的各步骤与实施例1的各步骤基本相同,在此不再赘述,不 同的是,在步骤1)中,第一链11和第二链12中的UMI分子标签中的部分固定碱基不同。
在实施例3中,第一链11的序列如序列表中SEQ ID NO:261所示,第二链12的序列如序列表中SEQ ID NO:262所示。
其中,第一链11和第二链12也可以如下表5所示:
表5
第一链11 5'-aatgatacggcgaccaccgagatctnnnnnnnnacactctttccctacacgacgctcttccgatcnagctnagctn-s-t-3'
第二链12 3'-g-s-ttcgtcttctgccgtatgctctannnnnnnncactgacctcaagtctgcacacgagaaggctagntcgantcgan-p’-5'
在第一链11中的N选自4个不同的碱基的情况下,上述第一链11和第二链12中的UMI分子标签20的序列各有64条,该64条UMI分子标签20的序列如下表6所示:
表6
Figure PCTCN2021134159-appb-000005
Figure PCTCN2021134159-appb-000006
2、文库构建和测序:
实验例1
步骤1)定制菁良基因公司的多突变位点的cfDNA标准品作为样本,突变频率为1%和0.1%,采用的标准品为cfDNA的样本,不需要进行片段化,可直接进行文库构建。
步骤2)采用KAPA试剂盒对cfDNA进行末端修复和加A尾。
步骤3)采用KAPA试剂盒,并采用实施例1合成的接头,对cfDNA连 接接头,得到接头连接产物。
步骤4)对接头连接产物进行扩增富集并纯化,得到cfDNA文库。
步骤5)采用IDT(Integrated DNA Technologies)全套试剂盒,对接头连接产物进行靶向捕获,得到选定基因的接头连接产物。
步骤6)以步骤4)所获得的cfDNA文库作为样本,使用Novaseq 6000(Illumina)仪器,按照该仪器的常规使用方式进行上机测序。
步骤7)使用FastQC软件对下机数据基本质控进行分析,实际检出位点及突变与理论值基本一致,具体检测结果如下表7和表8所示。
实验例2
实验例2中各步骤与实验例1中各步骤基本相同,在此不再赘述,不同的是,在步骤3)中采用实施例2合成的接头进行文库构建,实际检出位点及突变与理论值液基本一致,具体检测结果如下表7和表8所示。
实验例3
实验例3中各步骤与实验例1中各步骤基本相同,在此不再赘述,不同的是,在步骤3)中采用实施例3合成的接头进行文库构建,实际检出位点及突变与理论值液基本一致,具体检测结果如下表7和表8所示。
表7
Figure PCTCN2021134159-appb-000007
表8
Figure PCTCN2021134159-appb-000008
其中,在表7中,实验例1对选定基因的不同突变位点的实际检测突变频率基本在0.089%~0.12%之间,与理论突变频率(0.1%)相比较为准确,实验例2对选定基因的不同突变位点的实际检测突变频率基本在0.081%~0.150%之间,与理论突变频率相比较也均为准确,实验例3对选定基因的不同突变位点的实际检测突变频率基本在0.079%~0.140%之间,与理论突变频率相比较也较为准确。
在表8中,实验例1对选定基因的不同突变位点的实际检测突变频率基本在0.80%~1.20%之间,与理论突变频率(1%)相比较为准确,实验例2对选定基因的不同突变位点的实际检测突变频率基本在0.85%~1.30%之间,与理论突变频率相比较也均为准确,实验例3对选定基因的不同突变位点的实际检测突变频率基本在0.78%~1.25%之间,与理论突变频率相比较也较为准确。
综上所述,通过采用部分固定碱基的UMI分子标签,既可以保证接头的多样性,标记不同的原始的DNA片段,又能够在一定程度上排除PCR扩增或测序引入的噪音突变,从而可以提高检测准确性。
以上所述,仅为本公开的具体实施方式,但本公开的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本公开揭露的技术范围内,想到变化或替换,都应涵盖在本公开的保护范围之内。因此,本公开的保护范围应 以所述权利要求的保护范围为准。

Claims (19)

  1. 一种UMI分子标签,包括:
    至少一个随机碱基和至少一个固定碱基。
  2. 根据权利要求1所述的UMI分子标签,其中,
    所述随机碱基和所述固定碱基中至少其中一个为多个;
    多个所述随机碱基和/或多个所述固定碱基连续排列;
    或者,
    多个随机碱基中至少有两个随机碱基间隔排列,和/或,多个固定碱基中至少有两个固定碱基间隔排列。
  3. 根据权利要求2所述的UMI分子标签,其中,
    所述随机碱基为多个,且多个随机碱基中至少有两个随机碱基间隔排列,间隔排列的每两个随机碱基之间间隔1~5个固定碱基。
  4. 根据权利要求3所述的UMI分子标签,其中,
    所述随机碱基为至少三个;
    所述至少三个随机碱基两两之间均间隔排列,间隔排列的每两个随机碱基之间间隔一组固定碱基,每两组固定碱基所包含的固定碱基的数量相同。
  5. 根据权利要求4所述的UMI分子标签,其中,
    在每相邻的两组固定碱基中,其中一组固定碱基中至少有一个固定碱基与另一组固定碱基中的一个固定碱基不同。
  6. 根据权利要求3~5任一项所述的UMI分子标签,其中,
    间隔排列的每两个随机碱基之间间隔2个~4个固定碱基,且所述2个~4个固定碱基均不相同。
  7. 根据权利要求3~6任一项所述的UMI分子标签,其中,
    所述随机碱基为3个。
  8. 根据权利要求1~7任一项所述的UMI分子标签,其中,
    所述UMI分子标签包括7个~11个碱基。
  9. 一种分子标签组,包括:
    两个UMI分子标签,所述两个UMI分子标签通过至少部分碱基互补配对而结合;
    其中,至少一个UMI分子标签为如权利要求1~8任一项所述的UMI分子标签。
  10. 一种接头,包括:
    第一链和第二链;以及
    至少一个UMI分子标签,每个UMI分子标签位于所述第一链或第二链上, 所述至少一个UMI分子标签为如权利要求1~8任一项所述的UMI分子标签。
  11. 根据权利要求10所述的接头,其中,
    所述UMI分子标签为两个,两个UMI分子标签分别位于第一链和第二链上,并通过至少部分碱基互补配对而结合。
  12. 根据权利要求11所述的接头,其中,
    所述第一链为正向链,所述第二链为反向链;
    所述第一链包括第一测序引物序列,所述第二链包括第二测序引物序列,位于所述第一链上的UMI分子标签位于所述第一测序引物序列的下游,位于所述第二链上的UMI分子标签位于所述第二测序引物序列的上游。
  13. 一种接头连接试剂,包括:
    多种接头,所述多种接头为如权利要求10~12任一项所述的接头;
    在所述多种接头中,每两种接头所包含的至少一个UMI分子标签的至少一个随机碱基不同。
  14. 一种试剂盒,包括:
    如权利要求13所述的接头连接试剂。
  15. 一种如权利要求1~8任一项所述的UMI分子标签在基因测序中的应用。
  16. 根据权利要求15所述的UMI分子标签在基因测序中的应用,其中,
    所述基因包括用于遗传信息表达的DNA分子;
    所述UMI分子标签被配置为对不同的DNA分子进行标记。
  17. 一种DNA的文库构建方法,包括:
    获取片段化DNA;
    对片段化DNA进行末端修复加A,得到末端修复产物;
    采用如权利要求13所述的接头连接试剂对末端修复产物进行处理,使所述接头连接试剂中的接头与末端修复产物发生反应,得到接头连接产物;
    对接头连接产物进行富集,得到DNA文库。
  18. 一种基因测序检测方法,包括:
    使用如权利要求17所述的DNA的文库构建方法所获得的DNA文库对DNA进行基因测序。
  19. 一种试剂盒,包括:
    如权利要求18所述的DNA的文库构建方法所获得的DNA文库。
PCT/CN2021/134159 2021-11-29 2021-11-29 Umi分子标签及其应用、接头、接头连接试剂及试剂盒和文库构建方法 WO2023092601A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
CN202180003697.6A CN116529430A (zh) 2021-11-29 2021-11-29 Umi分子标签及其应用、接头、接头连接试剂及试剂盒和文库构建方法
PCT/CN2021/134159 WO2023092601A1 (zh) 2021-11-29 2021-11-29 Umi分子标签及其应用、接头、接头连接试剂及试剂盒和文库构建方法
US17/912,373 US20240209349A1 (en) 2021-11-29 2021-11-29 Umi and application thereof, molecular identifier group, adapter, adapter ligation reagent, kits, method for constructing dna library and method for sequencing gene

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/134159 WO2023092601A1 (zh) 2021-11-29 2021-11-29 Umi分子标签及其应用、接头、接头连接试剂及试剂盒和文库构建方法

Publications (1)

Publication Number Publication Date
WO2023092601A1 true WO2023092601A1 (zh) 2023-06-01

Family

ID=86538783

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/134159 WO2023092601A1 (zh) 2021-11-29 2021-11-29 Umi分子标签及其应用、接头、接头连接试剂及试剂盒和文库构建方法

Country Status (3)

Country Link
US (1) US20240209349A1 (zh)
CN (1) CN116529430A (zh)
WO (1) WO2023092601A1 (zh)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150072344A1 (en) * 2013-09-10 2015-03-12 Imdaptive Incorporated Barcoded Universal Marker Indicator (BUMI) Tags
WO2016049929A1 (zh) * 2014-09-30 2016-04-07 天津华大基因科技有限公司 构建测序文库的方法及其应用
CN108300716A (zh) * 2018-01-05 2018-07-20 武汉康测科技有限公司 接头元件、其应用和基于不对称多重pcr进行靶向测序文库构建的方法
CN109486811A (zh) * 2018-09-25 2019-03-19 深圳华大基因股份有限公司 双端分子标签接头及其用途和带有该接头的测序文库
US20190194648A1 (en) * 2016-08-02 2019-06-27 Ocean University Of China Construction method for serial sequencing libraries of rad tags
CN113502287A (zh) * 2021-06-28 2021-10-15 深圳市核子基因科技有限公司 分子标签接头及测序文库的构建方法
WO2021227129A1 (zh) * 2020-05-14 2021-11-18 北京安智因生物技术有限公司 一种通用型高通量测序接头及其应用

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150072344A1 (en) * 2013-09-10 2015-03-12 Imdaptive Incorporated Barcoded Universal Marker Indicator (BUMI) Tags
WO2016049929A1 (zh) * 2014-09-30 2016-04-07 天津华大基因科技有限公司 构建测序文库的方法及其应用
US20190194648A1 (en) * 2016-08-02 2019-06-27 Ocean University Of China Construction method for serial sequencing libraries of rad tags
CN108300716A (zh) * 2018-01-05 2018-07-20 武汉康测科技有限公司 接头元件、其应用和基于不对称多重pcr进行靶向测序文库构建的方法
CN109486811A (zh) * 2018-09-25 2019-03-19 深圳华大基因股份有限公司 双端分子标签接头及其用途和带有该接头的测序文库
WO2021227129A1 (zh) * 2020-05-14 2021-11-18 北京安智因生物技术有限公司 一种通用型高通量测序接头及其应用
CN113502287A (zh) * 2021-06-28 2021-10-15 深圳市核子基因科技有限公司 分子标签接头及测序文库的构建方法

Also Published As

Publication number Publication date
CN116529430A (zh) 2023-08-01
US20240209349A1 (en) 2024-06-27

Similar Documents

Publication Publication Date Title
JP6959378B2 (ja) 酵素不要及び増幅不要の配列決定
US20220267845A1 (en) Selective Amplfication of Nucleic Acid Sequences
CN108893466B (zh) 测序接头、测序接头组和超低频突变的检测方法
JP6925424B2 (ja) 短いdna断片を連結することによる一分子シーケンスのスループットを増加する方法
CN110129415B (zh) 一种ngs建库分子接头及其制备方法和用途
CN109468384B (zh) 一种同时检测45个y基因座的复合扩增检测试剂盒
CN113005121B (zh) 接头元件、试剂盒及其相关应用
CN109486811A (zh) 双端分子标签接头及其用途和带有该接头的测序文库
US11761037B1 (en) Probe and method of enriching target region applicable to high-throughput sequencing using the same
WO2019144582A1 (zh) 用于检测基因突变和已知、未知基因融合类型的高通量测序靶向捕获目标区域的探针和方法
CN111073961A (zh) 一种基因稀有突变的高通量检测方法
CN109576346A (zh) 高通量测序文库的构建方法及其应用
CN110869515B (zh) 用于基因组重排检测的测序方法
CN107257862A (zh) 从多个引物测序以增加数据速率和密度
CN110004225B (zh) 一种肿瘤化疗药个体化基因检测试剂盒、引物及方法
KR20170133270A (ko) 분자 바코딩을 이용한 초병렬 시퀀싱을 위한 라이브러리 제조방법 및 그의 용도
CN108359723B (zh) 一种降低深度测序错误的方法
WO2024037449A1 (zh) 一种高通量构建rna测序文库的方法及试剂盒
WO2023092601A1 (zh) Umi分子标签及其应用、接头、接头连接试剂及试剂盒和文库构建方法
CN113840923A (zh) 用于核酸检测的方法、系统和设备
US20240301466A1 (en) Efficient duplex sequencing using high fidelity next generation sequencing reads
EP4428244A2 (en) Methods and compositions for analyzing nucleic acid
WO2023201487A1 (zh) 接头、接头连接试剂及试剂盒和文库构建方法
CN117965709B (zh) 用于单端多重扩增检测基因突变频率的接头及使用方法
US20230323451A1 (en) Selective amplification of molecularly identifiable nucleic 5 acid sequences

Legal Events

Date Code Title Description
WWE Wipo information: entry into national phase

Ref document number: 202180003697.6

Country of ref document: CN

WWE Wipo information: entry into national phase

Ref document number: 17912373

Country of ref document: US

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21965330

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE