CN112466405B - Method for preparing molecular tag library for sequencing - Google Patents

Method for preparing molecular tag library for sequencing Download PDF

Info

Publication number
CN112466405B
CN112466405B CN202011540460.5A CN202011540460A CN112466405B CN 112466405 B CN112466405 B CN 112466405B CN 202011540460 A CN202011540460 A CN 202011540460A CN 112466405 B CN112466405 B CN 112466405B
Authority
CN
China
Prior art keywords
sequence
floor
molecular tag
sequences
temp2
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011540460.5A
Other languages
Chinese (zh)
Other versions
CN112466405A (en
Inventor
罗俊峰
陈曦
张稀
徐雪
汪进平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Carrier Gene Technology Suzhou Co ltd
Original Assignee
Carrier Gene Technology Suzhou Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Carrier Gene Technology Suzhou Co ltd filed Critical Carrier Gene Technology Suzhou Co ltd
Priority to CN202011540460.5A priority Critical patent/CN112466405B/en
Publication of CN112466405A publication Critical patent/CN112466405A/en
Application granted granted Critical
Publication of CN112466405B publication Critical patent/CN112466405B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • CCHEMISTRY; METALLURGY
    • C40COMBINATORIAL TECHNOLOGY
    • C40BCOMBINATORIAL CHEMISTRY; LIBRARIES, e.g. CHEMICAL LIBRARIES
    • C40B50/00Methods of creating libraries, e.g. combinatorial synthesis
    • C40B50/06Biochemical methods, e.g. using enzymes or whole viable microorganisms
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids

Abstract

The invention discloses a method for preparing a molecular tag library for sequencing, wherein a molecular tag is prepared by connecting sequence units (B7 sequences) of 7 bases in series according to a certain mode, through calculation of a coding mathematical formula, the rightmost 3 bases of the B7 sequence are correction codes of the leftmost 4 bases, and any base in the B7 sequence has an error and can be corrected back to a correct coding sequence through a decoding correction mathematical formula. The method for constructing the molecular tag library can ensure that the molecular tag library has enough molecular tag types, can ensure that the molecular tag sequence is known and controllable, can ensure that the molecular tag sequence is correctable and correctable, is favorable for improving the accuracy of a sequencing result and the accuracy and specificity identification of target molecules in a sample.

Description

Method for preparing molecular tag library for sequencing
Technical Field
The invention belongs to the technical field of biotechnology detection, and particularly relates to a method for preparing a molecular tag library for sequencing.
Background
In the detection technology of DNA fragment molecules, not only the information of DNA fragments but also the number of original DNA fragment molecules are required to be known in some cases, however, due to the existence of amplification links, a large number of identical fragments appear, PCR amplifies original target molecules of hundreds to tens of thousands to dozens of times of 2, the method annihilates the information of the number of original DNA fragment molecules, simultaneously introduces amplification errors, sequencing errors and the like which cannot be identified and corrected, in order to more accurately obtain the information of original sequence information, the number of molecules and the like of DNA, scientists mark molecular labels on the original DNA fragment molecules, analyze the sequence and the number information of the original DNA fragment molecules by utilizing the uniqueness of the molecular labels, and in most cases, the molecular labels are formed by a plurality of N (N is A/C/T/G) or H (H is A/T/C) at random during synthesis, for example, the library is composed of 12H, so 3^12 ^ 531441 sequences can be obtained, the molecular tag library is simple to obtain, the molecular tags in the library are enough in possibility, but the molecular tags are not artificially controllable, the sequence AAA AAA AAA AAA, AAA AAA AAA AAT can appear, the 2 tag sequences cannot be distinguished at all in the case that the subsequent amplification error and the sequencing error objectively exist, the two tag differences are caused by the error, the two tag differences are self-carried in the molecular tag library, meanwhile, the randomly formed molecular tag library cannot artificially control the CG content, a large number of continuously identical bases (such as AAA AAA AAA AAA, AAA AAA AAA AAT) can cause potential troubles for some sequencing platforms, for example, the two sequences are difficult to identify in the PGM sequencing platform of Thermo company, which in turn leads to the loss of information, which is objectively present. An important premise for the application of molecular tags is that enough molecular tag species are needed, and the molecular tag species can be tens of thousands or hundreds of thousands, and the random synthesis of the Facultispie bases is a low-cost mode, but the sequence and the proportion are uncontrollable, and the synthesis of enough molecular tag sequences one by one is very uneconomical although the sequence and the proportion are controllable.
Disclosure of Invention
In order to solve the technical problems, the invention discloses a preparation method of a molecular tag library consisting of non-random sequences, wherein the molecular tag is prepared by connecting sequence units (B7 sequences) of 7 bases in series according to a certain mode, through calculation of a coding mathematical formula, the rightmost 3 bases of the B7 sequence are correction codes of the leftmost 4 bases, and any one base in the B7 sequence has an error, so that the correct coding sequence can be corrected back through decoding the correction mathematical formula.
The first object of the present invention is to provide a method for preparing a molecular tag library for sequencing, comprising the steps of:
s1, designing a molecular tag B7 sequence, wherein the B7 sequence is designed according to the following method:
defining the 7-base sequence of the B7 sequence as (a B c d x y z); wherein the content of the first and second substances,
a, b, c and d are information bits and represent the digital sequence converted from a randomly generated 4-bit base sequence consisting of the base A, T, G, C, and the base A, T, C, G is converted into the digital sequence by the following mode: a is 1, T is 2, G is 3, C is 4;
x, y and z are check bits and are obtained by converting a, b, c and d according to the following formula:
Figure BDA0002854424280000021
Figure BDA0002854424280000022
Figure BDA0002854424280000023
wherein floor is a down-rounding function;
s2, combining a plurality of different molecular tag sequences with the specific sequence to obtain the specific sequence containing a molecular tag library; the molecular tag sequence consists of n (E2+ B7+ F2) units, wherein E2 is 0-5 basic groups; f2 is 0-5 bases; n is any integer of 1 to 20.
Furthermore, in the n pieces (E2+ B7+ F2), the CG percent is 35-75 percent.
Further, the ratio of the number of the molecular tag sequences in the molecular tag library for sequencing to the number of the target molecules is more than 10: 1. The use of a ratio greater than 10:1 can satisfy the poisson distribution requirement, ensuring that each target molecule has a greater than 95% probability of having a unique molecular tag sequence attached.
Further, the step of combining the molecular tag sequence with the specific sequence specifically comprises the following steps:
s01, dividing the synthesized specific sequences into different shares, and synthesizing each sequence in the unit with the number n being 1 one by one on each specific sequence;
and S02, mixing the sequences synthesized in S01, dividing the sequences into different shares, further synthesizing each sequence in the (n) -2 th unit one by one, and repeating the steps in the same way, and synthesizing the specific sequences containing the molecular tag library according to the number requirement of the molecular tag sequences.
Further, the specific sequence is a PCR amplification primer, a hybridization probe, an isothermal extension primer or a connection primer.
Further, the molecular tag B7 sequence is any one of the following sequences:
Figure BDA0002854424280000024
Figure BDA0002854424280000031
Figure BDA0002854424280000041
the second purpose of the invention is to provide an error correction method for a molecular tag library, which comprises the following steps:
s001, setting temporary values temp1, temp2 and temp3, temp1 ═ a + b + d + x; temp2 ═ b + c + d + y; temp3 ═ a + c + d + z;
s002, evaluating whether the information bits of a, b, c and d have errors or not according to the values of temp1, temp2 and temp 3;
and S003, if an error occurs, completing self-checking, replacing by using a correct information bit, converting into a base information sequence, and outputting.
Further, the specific steps of evaluating whether the information bits a, b, c and d have errors according to the values of temp1, temp2 and temp3 in the step S002 are as follows:
if temp1-4 floor (temp1/4) is not equal to 2 and temp2-4 floor (temp2/4) is not equal to 2, an error occurs at position b; the correct value for b is calculated at this time:
b1=14-a-d-x-4*floor((14-a-d-x-1)/4),b=b1
b is replaced by b 1;
and set b2 ═ 14-c-d-y-4 flow ((14-c-d-y-1)/4)
If B1 ≠ B2, it indicates that two or more information bit errors occur, the decoding output cannot be completed, and the correction process of the current B7 sequence exits;
if temp2-4 floor (temp2/4) is not equal to 2 and temp3-4 floor (temp3/4) is not equal to 2, an error occurs at position c; the correct value for c is calculated at this time:
c1=14-a-d-z-4*floor((14-a-d-z-1)/4),c=c1
c is replaced by c 1;
and c2 ═ 14-b-d-y-4 flow ((14-b-d-y-1)/4)
If c1 ≠ c2, it indicates that two or more information bit errors occur, the decoding output cannot be completed, and the correction process of the current B7 sequence exits;
if temp1-4 floor (temp1/4) is not equal to 2 and temp3-4 floor (temp3/4) is not equal to 2, an error occurs at position a; the correct value for a is calculated at this time:
a1=14-b-d-x-4*floor((14-b-d-x-1)/4),a=a1
replacing a with a 1;
and the setting a2 ═ 14-c-d-z-4 ═ floor ((14-c-d-z-1)/4)
If a1 ≠ a2, it indicates that two or more information bit errors occur, the decoding output cannot be completed, and the correction process of the current B7 sequence exits;
if temp1-4 floor (temp1/4) is not equal to 2, temp2-4 floor (temp2/4) is not equal to 2, and temp3-4 floor (temp3/4) is not equal to 2, an error occurs at position d; at this time, the correct value of d is calculated
d1=14-a-b-x-4*floor((14-a-b-x-1)/4),d=d1
D is replaced by d 1;
and d2 is set to 14-b-c-y-4 floor ((14-b-c-y-1)/4);
d3=14-a-c-z-4*floor((14-a-c-z-1)/4);
if d1 ≠ d2 ≠ d3, it is said that two or more information bit errors occur, the decoding output cannot be completed, and the correction process of the current B7 sequence exits;
where floor is a floor rounding function.
By the scheme, the invention at least has the following advantages:
the method for constructing the molecular tag library can ensure that the molecular tag library has enough molecular tag types, can ensure that the molecular tag sequence is known and controllable, can ensure that the molecular tag sequence is correctable and correctable, is favorable for improving the accuracy of a sequencing result and the accuracy and specificity identification of target molecules in a sample.
The foregoing is a summary of the present invention, and in order to provide a clear understanding of the technical means of the present invention and to be implemented in accordance with the present specification, the following is a preferred embodiment of the present invention and is described in detail below.
Detailed Description
Example 1: scheme for coding B7 sequence
3 check bits are added at the right end of the 4-bit base information to form a B7 sequence, and the data of the 3 check bits are obtained by the following algorithm, so that when one bit in the B7 sequence has errors, the correct sequence can be corrected back
1) First, a 4-base sequence consisting of A, T, G, C is randomly generated, or a certain 4-base sequence, for example, TTGA;
2) converting the 4-bit base sequence from an alphabetic sequence to a numeric sequence, wherein if the base is A or a, the base is 1; the basic group is T or T, and then is 2; the base is G or G, then 3; base is C or C, then 4, for example, a 4-base sequence is TTGA, then 2231 is the result of conversion to a digital sequence;
3) converting into a sequence of 4 digits, which is defined as abcd, such as the digit sequence 2231, where a is 2, b is 2, c is 3, and d is 1, and sequentially obtaining information of 3 check digits by using the following conversion formula;
Figure BDA0002854424280000061
Figure BDA0002854424280000062
Figure BDA0002854424280000063
wherein the floor is a downward rounding function in Matlab;
4) adding parity bits to the end of abcd, resulting in a digitized B7 sequence: a B c d x y z, and then converted into a B7 letter sequence;
5) after obtaining the B7 letter sequence, we also need to examine the GC content and the degree of in-sequence repetition of the sequence, and only if the GC content is greater than 0.2 and less than 0.8 is taken as an output. In addition, the repetition degree in the sequence is too high and is not taken as an output.
The following is an implementation of the B7 sequence in Matlab:
Figure BDA0002854424280000064
Figure BDA0002854424280000071
Figure BDA0002854424280000081
the practical effect is that a set of encoded B7 sequences can be obtained, for example, 240 sequences in Table 1
Table 1 self-error-correctable 240 sequences
Figure BDA0002854424280000091
Figure BDA0002854424280000101
Example 2: decoding and error correction scheme for B7 sequences
1) This example is for the decoding of B7 sequence, requiring the input DNA sequence length to be an integer multiple of 7, and defining each 7 base sequence as (a B c d x y z), where a, B, c, d are information bits, x, y, z are check bits;
2) converting the base information sequence into a digital sequence according to A → 1, T → 2, G → 3, C → 4;
3) calculating temporary values temp1, temp2 and temp3, and respectively making temp1 be a + b + d + x; temp2 ═ b + c + d + y; temp3 ═ a + c + d + z
4) The error at each of the information bits a, b, c, d is then evaluated based on the values of temp1, temp2, temp 3:
if temp1-4 floor (temp1/4) is not equal to 2 and temp2-4 floor (temp2/4) is not equal to 2, then an error occurs at position b; at this time, the correct value of b is calculated
b1=14-a-d-x-4*floor((14-a-d-x-1)/4),b=b1
B is replaced by b 1. Another formula is used:
b2=14-c-d-y-4*floor((14-c-d-y-1)/4)
if b1 is not equal to b2, it indicates that two or more information bit errors occur, the decoding output cannot be completed, and the current 7-base sequence correction process is exited;
if temp2-4 floor (temp2/4) is not equal to 2 and temp3-4 floor (temp3/4) is not equal to 2, then an error occurs at position c; at this time, the correct value of c is calculated
c1=14-a-d-z-4*floor((14-a-d-z-1)/4),c=c1
C is replaced by c 1. Another formula is also used
c2=14-b-d-y-4*floor((14-b-d-y-1)/4)
If c1 is not equal to c2, the decoding output cannot be completed because two or more information bit errors occur, and the current 7-base sequence correction process is exited;
if temp1-4 floor (temp1/4) is not equal to 2 and temp3-4 floor (temp3/4) is not equal to 2, then an error occurs at position a; at this time, the correct value of a is calculated
a1=14-b-d-x-4*floor((14-b-d-x-1)/4),a=a1
A is replaced with a 1. Another formula is used:
a2=14-c-d-z-4*floor((14-c-d-z-1)/4)
if a1 is not equal to a2, the decoding output cannot be completed because two or more information bit errors occur, and the current 7-base sequence correction process is exited;
if temp1-4 floor (temp1/4) is not equal to 2, temp2-4 floor (temp2/4) is not equal to 2, and temp3-4 floor (temp3/4) is not equal to 2, then an error occurs at position d; at this time, the correct value of d is calculated
d1=14-a-b-x-4*floor((14-a-b-x-1)/4),d=d1
D is replaced by d 1. In addition, 2 formulas are used
d2=14-b-c-y-4*floor((14-b-c-y-1)/4);
d3=14-a-c-z-4*floor((14-a-c-z-1)/4);
If d1 is not equal to d2 is not equal to d3, the decoding output cannot be completed due to two or more information bit errors, and the current 7-base sequence correction process is exited;
wherein the floor is a downward rounding function in Matlab;
5) if an error occurs, the self-checking is completed, and the correct information bit is used for replacing and is converted into a base information sequence, so that the output can be realized.
The following is an implementation of the decoding and correction process of the B7 sequence in Matlab:
Figure BDA0002854424280000111
Figure BDA0002854424280000121
Figure BDA0002854424280000131
example 3: method for preparing molecular tag (E2+ B7+ F2) n by synthesis
The preparation procedure assuming that n is 4 is as follows:
1. preparation of specific primers with 331,776 molecular tag
a) A sufficient amount of the desired specific sequence FP, e.g., 5-GGACCCCCACACAGCAAA-3, is synthesized, and the number of molecules is divided into 24;
b) the sequence of round E2+ B7+ F2, e.g., the following 24 sequences (5 '-3'), was determined. These 24 sequences are synthesized one by one on the basis of each specific sequence, for example the 1st sequence ACaagggaaAC in the table below is synthesized on the basis of the 1st specific sequence FP, and so on. After the synthesis is finished, the number of molecules is equally divided into 24 parts again, and the n-th-2-round synthesis is prepared;
Figure BDA0002854424280000132
Figure BDA0002854424280000141
c) the sequences to be used in the n-2 th round are determined, for example, the following 24 sequences (5 '-3') are synthesized one by one on the basis of each of the n-1 th round mixtures, for example, the 1st sequence ACataattcAC in the following table is synthesized on the basis of the first n-1 th round mixture, after the synthesis is completed, the 24 n-2 th round sequences are obtained, and then the molecules are mixed in equal parts, and the molecules are further divided into 24 parts to prepare the n-3 th round synthesis;
1 ACataattcAC 7 ACcaagtgtAC 13 ACgaatctcAC 19 ACtaatataAC
2 ACaatttaaAC 8 ACcagcctgAC 14 ACgacctcgAC 20 ACtaccgctAC
3 ACactaagtAC 9 ACccgggccAC 15 ACgcctataAC 21 ACtcactcgAC
4 ACacggtcaAC 10 ACcgcattcAC 16 ACggaccttAC 22 ACtccgagaAC
5 ACagcagtaAC 11 ACccattggAC 17 ACgtaggcaAC 23 ACtgtaccaAC
6 ACatctacgAC 12 ACcttgataAC 18 ACgcaatagAC 24 ACttagtccAC
d) determining sequences to be used in the n-3 th round, for example, the following 24 sequences (5 '-3') were synthesized one by one on the basis of each of the n-2 th round mixtures to obtain 24 n-3 th round sequences in the same manner as in the previous round, and then the sequences were mixed by equal number of molecules, and the number of molecules was further divided into 24 parts to prepare the n-4 th round synthesis;
1 ACatataatAC 7 ACcataatgAC 13 ACgaactatAC 19 ACtaacgagAC
2 ACaatcaggAC 8 ACcaggtacAC 14 ACgacgatcAC 20 ACtacgctaAC
3 ACacttgcaAC 9 ACcgaatgaAC 15 ACgcccgagAC 21 ACtcagatcAC
4 ACagaaggcAC 10 ACcgctaatAC 16 ACggagtaaAC 22 ACtcgagtcAC
5 ACagctcagAC 11 ACccaggttAC 17 ACgttacacAC 23 ACtgtcaacAC
6 ACatccgtcAC 12 ACctcaggcAC 18 ACgcatagcAC 24 ACttcaaggAC
e) determining sequences to be used in the n-th-4 round, for example, the following 24 sequences (5 '-3') were synthesized one by one on the basis of each of the n-th-3 round mixtures, and in the same manner as in the previous round, 24 fourth round sequences were obtained after completion of the synthesis, and then mixed by equal number of molecules to prepare for synthesis of universal sequences;
1 ACatacggaAC 7 ACcattgacAC 13 ACgaagagaAC 19 ACtaagcgcAC
2 ACaatggccAC 8 ACctaagtaAC 14 ACgagaggaAC 20 ACtagatgcAC
3 ACactcctgAC 9 ACcgatacgAC 15 ACgccgcgcAC 21 ACtctaggaAC
4 ACagatcctAC 10 ACcgccggaAC 16 ACggtaaccAC 22 ACtcgtcatAC
5 ACagcctgcAC 11 ACcctacggAC 17 ACgttcacaAC 23 ACtgcactgAC
6 ACatcgcatAC 12 ACctctcctAC 18 ACgcacgctAC 24 ACttctgccAC
f) on the basis of the n-th-4 rounds of synthesized mixture, universal sequence tgt aaa acg acg gcc agt aca was further synthesized, so that a mixture of molecular tags (E2+ B7+ F2)4 with specific primers and universal sequences was obtained, wherein 24 × 24 × 24 × 24 ═ 331,776 molecular tags were included, the sequences of the molecular tags were known, the ratio between the molecular tags was 1:1, and the self-correcting function was achieved, and finally the FP sequence with the molecular tags was obtained as: 5-tgtaaaacgacggccagtaca (N44) GGACCCCCACACAGCAAA-3;
g) increasing the number of n can obtain longer molecular tag sequences, and the number of molecular tags also increases, for example, n is 5, and the number of molecular tags is 24 × 24 × 24 × 24 is 7,962,624; it is also possible to keep n equal to 4 and increase the number of the types (E2+ B7+ F2) in each round, for example, 36, i.e., 36 × 36 × 36 × 36 equal to 1,679,616.
2. Synthesis of specific sequence RP: 5-AAG TTA AAA TTC CCG TCG CTA TCA A-3 and the UNITag sequence: 5-tgt aaa acg acg gcc agt aca-3, mixing the FP sequence (UMI-FP) with 331,776 molecular tag synthesized above and the RP sequence according to the following system, and performing PCR amplification;
a) configuration of the 5 × Oligo mix System
Primer concentration (μ M) Volume (μ L)
UMI-FP 100 20
RP 100 20
0.1×TE Make up to 1000 μ L
Total 1000μL
b) Configuration of PCR System
Reagent composition Volume (μ L)
5×Oligo Mix 6μL
2×Taq Master Mix 15μL
Ultrasonic disruption of genomic DNA template 10ng/30ng/100ng (3 DNA inputs each repeated 3 times)
Nuclease Free Water Make up to 30 mu L
c) UMI-PCR amplification procedure
Figure BDA0002854424280000151
After the PCR was completed, 1 unit of exonuclease I was added to each reaction, and the reaction was incubated at 37 ℃ for 30 minutes and inactivated at 80 ℃ for 30 minutes. Additional 2. mu.L of 10. mu.M RP and 2. mu.L of 10. mu.M UNITag were added for the subsequent PCR amplification procedure.
d) PCR amplification procedure
Figure BDA0002854424280000152
e) The three 10ng/30ng/100ng 9-tube amplification products are subjected to library construction and sequencing by using a commercial Illumina library construction kit, the diversity of molecular tags is analyzed finally, the reads number of each molecular tag needs to be more than 6 to be counted as one molecular tag, the statistical data is shown in the following table, and the molecular tag library prepared by the embodiment can save or correct about 10% of effective data on average according to the analysis result, so that the effect is very obvious.
Figure BDA0002854424280000161
Example 4: process method for constructing (E2+ B7+ F2) n molecular tag library by using connection method
1. 30 (E2+ B7+ F2)2 sequences with a CG% content of 50% were selected, wherein E2 and F2 are 0 bases as shown in the following table:
ID Seq ID Seq
HMB401 aacggttaagacgg HMB416 agtctagatccgtc
HMB402 aacttggaatggcc HMB417 atacggaatggcga
HMB403 aagggaaacaccca HMB418 atcgcatattcgcg
HMB404 aagttccaccgtgt HMB419 atctacgcaaccac
HMB405 aatcaggacctgtg HMB420 atgatcgcacgttg
HMB406 acagttgacggtca HMB421 atgcgatcactggt
HMB407 acatggtacgtgac HMB422 attgctccaggtac
HMB408 acgaatgactcctg HMB423 caagtgtcagtgca
HMB409 actgtacagaaggc HMB424 caatgtgcatccgt
HMB410 acttgcaagcgact HMB425 cacaaacccaggtt
HMB411 agagaagagctcag HMB426 cagaagtccattgg
HMB412 agatcctaggagag HMB427 catgtcaccgactt
HMB413 agcagtaaggctct HMB428 cattgaccctggaa
HMB414 agggataagtgagc HMB429 ccaacaacctttcc
HMB415 agtagctatagccg HMB430 ccgttaacgagcat
synthesis of PO3-aaccaccaccaaca + HMB # + accaacaaaccacc sequences, 30 in total, uniformly mixed according to equal molecular number for standby, and marked as UMIseq 30.
500 sequences (E2+ B7+ F2)2 were selected, where E2 and F2 ═ 0 bases, as shown in the table below
Figure BDA0002854424280000162
Figure BDA0002854424280000171
Figure BDA0002854424280000181
Figure BDA0002854424280000191
AGACGTGTGCTCTTCCGATCTATCA + HMB # + aaccaccaccaaca sequences are synthesized, and the total number of the sequences is 500, the molecules are evenly mixed for standby, and the sequences are marked as UMIseq 500.
3. The following sequences were synthesized:
sequence name Sequence (5 '-3')
Primer 1st Stem complementation ggtggtttgttggtggtggtttgttggt
1st-2nd stemComplementary to each other tgttggtggtggtttgttggtggtggtt
Wherein, the last base T at the 3 terminal is the ddT modified by dideoxy.
4. Synthesis of the following Table specific primer sequences
Figure BDA0002854424280000192
Figure BDA0002854424280000201
Wherein, 5 ends of all the sequences in the table are modified by PO3 phosphate groups.
5. Connecting step
Primer sequences with molecular tags of 37 species were obtained by adjusting the sequence of each specific primer, primer 1st stem complementation, 1st-2nd stem complementation, UMIseq30 and UMIseq500 to a concentration of 2. mu.M based on the total molecular weight, mixing them at a volume ratio of 1:2:2:2:2, ligating them with a commercial ligase kit and performing the procedures as recommended by the manufacturer, wherein 30X 500-15,000 species are assigned to each primer. It should be noted that, according to the principle of poisson distribution, it is not sufficient to label 10ng of about 3000 copies of a molecule with 15,000 molecular tags, and it may not be sufficient to label each of 3000 molecules with a unique tag, but for a low proportion of mutant molecules, say 1%, there are only 30 molecules, and then the labeling of 30 molecules with 15,000 sequences is sufficient to label each mutant molecule with a unique tag, so this example is more suitable for the field of tumor detection or for detecting low proportions of target molecules. The amplification of the molecular tag species in this example is also simple, i.e., the number of species in the 1st stem sequence and the 2nd stem sequence can be increased, e.g., to 40 species and 1000 species, respectively, and the number of the finally obtained molecular tag species is equal to 40,000.
The above is only a preferred embodiment of the present invention, and is not intended to limit the present invention, it should be noted that, for those skilled in the art, many modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be regarded as the protection scope of the present invention.

Claims (8)

1. A method for preparing a molecular tag library for sequencing is characterized by comprising the following steps:
s1, designing a molecular tag B7 sequence, wherein the B7 sequence is designed according to the following method:
defining the 7-base sequence of the B7 sequence as (a B c d x y z); wherein the content of the first and second substances,
a, b, c and d are information bits and represent digital sequences converted from 4-base sequences randomly formed by the base A, T, G, C, and the base A, T, C, G is converted into the digital sequences in a mode that: a is 1, T is 2, G is 3, C is 4;
x, y and z are check bits and are obtained by converting a, b, c and d according to the following formula:
Figure FDA0003039579390000011
Figure FDA0003039579390000012
Figure FDA0003039579390000013
wherein floor is a down-rounding function;
s2, combining a plurality of different molecular tag sequences with the specific sequence to obtain the specific sequence containing a molecular tag library; the molecular tag sequence consists of n (E2+ B7+ F2) units, wherein E2 is 0-5 basic groups; f2 is 0-5 bases; n is any integer of 1 to 20.
2. The method of claim 1, wherein the CG% of the n (E2+ B7+ F2) is between 35% and 75%.
3. The method of claim 1, wherein the ratio of the number of molecular tag sequences in the molecular tag library for sequencing to the number of target molecules is greater than 10: 1.
4. The method of claim 1, wherein the step of binding the molecular tag sequence to the specific sequence comprises the steps of:
s01, dividing the synthesized specific sequences into different shares, and synthesizing each sequence in the unit with the number n being 1 one by one on each specific sequence;
and S02, mixing the sequences synthesized in S01, dividing the sequences into different shares, further synthesizing each sequence in the (n) -2 th unit one by one, and repeating the steps in the same way, and synthesizing the specific sequences containing the molecular tag library according to the number requirement of the molecular tag sequences.
5. The method of claim 1, wherein the specific sequence is a PCR amplification primer, a hybridization probe, an isothermal extension primer, or a ligation primer.
6. The method according to claim 1, wherein the molecular tag B7 sequence is any one of the following sequences:
Figure FDA0003039579390000014
Figure FDA0003039579390000021
Figure FDA0003039579390000031
7. an error correction method for a molecular tag library is characterized by comprising the following steps:
s001, setting temporary values temp1, temp2 and temp3, temp1 ═ a + b + d + x; temp2 ═ b + c + d + y; temp3 ═ a + c + d + z;
wherein a, B, c, d, x, y and z represent 7 bases of the sequence of molecular tag B7,
a, b, c and d are information bits and represent digital sequences converted from 4-base sequences randomly formed by the base A, T, G, C, and the base A, T, C, G is converted into the digital sequences in a mode that: a is 1, T is 2, G is 3, C is 4;
x, y and z are check bits and are obtained by converting a, b, c and d according to the following formula:
Figure FDA0003039579390000032
Figure FDA0003039579390000033
Figure FDA0003039579390000034
wherein floor is a down-rounding function;
s002, evaluating whether the information bits of a, b, c and d have errors or not according to the values of temp1, temp2 and temp 3;
and S003, if an error occurs, completing self-checking, replacing by using a correct information bit, converting into a base information sequence, and outputting.
8. The method for correcting errors in molecular tag library according to claim 7, wherein the step S002 comprises the steps of evaluating whether the information bits a, b, c and d have errors according to the values of temp1, temp2 and temp 3:
if temp1-4 floor (temp1/4) is not equal to 2 and temp2-4 floor (temp2/4) is not equal to 2, an error occurs at position b; the correct value for b is calculated at this time:
b1=14-a-d-x-4*floor((14-a-d-x-1)/4),b=b1
b is replaced by b 1;
and set b2 ═ 14-c-d-y-4 flow ((14-c-d-y-1)/4)
If B1 ≠ B2, it indicates that two or more information bit errors occur, the decoding output cannot be completed, and the correction process of the current B7 sequence exits;
if temp2-4 floor (temp2/4) is not equal to 2 and temp3-4 floor (temp3/4) is not equal to 2, an error occurs at position c; the correct value for c is calculated at this time:
c1=14-a-d-z-4*floor((14-a-d-z-1)/4),c=c1
c is replaced by c 1;
and c2 ═ 14-b-d-y-4 flow ((14-b-d-y-1)/4)
If c1 ≠ c2, it indicates that two or more information bit errors occur, the decoding output cannot be completed, and the correction process of the current B7 sequence exits;
if temp1-4 floor (temp1/4) is not equal to 2 and temp3-4 floor (temp3/4) is not equal to 2, an error occurs at position a; the correct value for a is calculated at this time:
a1=14-b-d-x-4*floor((14-b-d-x-1)/4),a=a1
replacing a with a 1;
and the setting a2 ═ 14-c-d-z-4 ═ floor ((14-c-d-z-1)/4)
If a1 ≠ a2, it indicates that two or more information bit errors occur, the decoding output cannot be completed, and the correction process of the current B7 sequence exits;
if temp1-4 floor (temp1/4) is not equal to 2, temp2-4 floor (temp2/4) is not equal to 2, and temp3-4 floor (temp3/4) is not equal to 2, an error occurs at position d; at this time, the correct value of d is calculated
d1=14-a-b-x-4*floor((14-a-b-x-1)/4),d=d1
D is replaced by d 1;
and d2 is set to 14-b-c-y-4 floor ((14-b-c-y-1)/4);
d3=14-a-c-z-4*floor((14-a-c-z-1)/4);
if d1 ≠ d2 ≠ d3, it is said that two or more information bit errors occur, the decoding output cannot be completed, and the correction process of the current B7 sequence exits;
where floor is a floor rounding function.
CN202011540460.5A 2020-12-23 2020-12-23 Method for preparing molecular tag library for sequencing Active CN112466405B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011540460.5A CN112466405B (en) 2020-12-23 2020-12-23 Method for preparing molecular tag library for sequencing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011540460.5A CN112466405B (en) 2020-12-23 2020-12-23 Method for preparing molecular tag library for sequencing

Publications (2)

Publication Number Publication Date
CN112466405A CN112466405A (en) 2021-03-09
CN112466405B true CN112466405B (en) 2021-06-22

Family

ID=74803373

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011540460.5A Active CN112466405B (en) 2020-12-23 2020-12-23 Method for preparing molecular tag library for sequencing

Country Status (1)

Country Link
CN (1) CN112466405B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114774516B (en) * 2022-03-28 2024-04-12 深圳裕康医学检验实验室 UMI sequence design method for correcting sequencing errors and application thereof

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2004083819A2 (en) * 2003-03-17 2004-09-30 Trace Genetics, Inc Molecular forensic specimen marker
CN105119717A (en) * 2015-07-21 2015-12-02 郑州轻工业学院 DNA coding based encryption system and encryption method
CN106086162A (en) * 2015-11-09 2016-11-09 厦门艾德生物医药科技股份有限公司 A kind of double label joint sequences for detecting Tumor mutations and detection method
CN107365861A (en) * 2017-08-28 2017-11-21 华中农业大学 A kind of molecular labeling for differentiating dark-brown cotton
CN108932401A (en) * 2018-06-07 2018-12-04 江西海普洛斯生物科技有限公司 It is a kind of be sequenced sample identification method and its application
CN109337966A (en) * 2017-08-01 2019-02-15 上海禀远生物科技有限公司 A kind of molecular label and its reagent and application

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2001062935A1 (en) * 2000-02-25 2001-08-30 Villoo Morawala Patell A process for constructing dna based molecular marker for enabling selection of drought and diseases resistant germplasm screening
US7155453B2 (en) * 2002-05-22 2006-12-26 Agilent Technologies, Inc. Biotechnology information naming system
EP2619327B1 (en) * 2010-09-21 2014-10-22 Population Genetics Technologies LTD. Increasing confidence of allele calls with molecular counting
CN109326322B (en) * 2018-08-17 2020-12-08 华中科技大学 Method and system for comparing QTL (quantitative trait loci) among different segregation groups of crops

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2004083819A2 (en) * 2003-03-17 2004-09-30 Trace Genetics, Inc Molecular forensic specimen marker
CN105119717A (en) * 2015-07-21 2015-12-02 郑州轻工业学院 DNA coding based encryption system and encryption method
CN106086162A (en) * 2015-11-09 2016-11-09 厦门艾德生物医药科技股份有限公司 A kind of double label joint sequences for detecting Tumor mutations and detection method
CN109337966A (en) * 2017-08-01 2019-02-15 上海禀远生物科技有限公司 A kind of molecular label and its reagent and application
CN107365861A (en) * 2017-08-28 2017-11-21 华中农业大学 A kind of molecular labeling for differentiating dark-brown cotton
CN108932401A (en) * 2018-06-07 2018-12-04 江西海普洛斯生物科技有限公司 It is a kind of be sequenced sample identification method and its application

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
Advances in molecular marker techniques and their applications in plant sciences;Milee Agarwal 等;《Plant Cell Rep》;20080202;617-631页 *
ATGC transcriptomics: a web-based application to integrate, explore and analyze de novo transcriptomic data;Sergio Gonzalez 等;《BMC Bioinformatics》;20170222;第1-9页 *
一种基于卷积码模型的遗传序列分析方法;刘晓 等;《西北农林科技大学学报(自然科学版)》;20100410;第38卷(第4期);第207-214页 *
基因测序"黑科技"给生命来个完整的"数字化解读";张佳星;《科技日报》;20190625;第1-2页 *

Also Published As

Publication number Publication date
CN112466405A (en) 2021-03-09

Similar Documents

Publication Publication Date Title
CN110945595B (en) DNA-based data storage and retrieval
Faircloth et al. Not all sequence tags are created equal: designing and validating sequence identification tags robust to indels
Hasegawa et al. Heterogeneity of tempo and mode of mitochondrial DNA evolution among mammalian orders
Organick et al. Scaling up DNA data storage and random access retrieval
Tippery et al. Evaluation of phylogenetic relationships in L emnaceae using nuclear ribosomal data
CN112466405B (en) Method for preparing molecular tag library for sequencing
Michel et al. Bijective transformation circular codes and nucleotide exchanging RNA transcription
AU2015286672B2 (en) Methods and products for quantifying RNA transcript variants
JP6664575B2 (en) Nucleic acid molecule counting method
CN107155361A (en) Code generating method, code generating unit and computer-readable recording medium
Sierro et al. Whole genome profiling physical map and ancestral annotation of tobacco H icks B roadleaf
CN112382340A (en) Coding and decoding method and coding and decoding device for binary information to base sequence for DNA data storage
CN109797438A (en) A kind of joint component and library constructing method quantifying sequencing library building for the variable region 16S rDNA
CN101845500B (en) Method for correcting sequence abundance deviation of secondary high-flux sequence test by DNA sequence bar codes
CN112749247A (en) Text information storage and reading method and device
WO2019204702A1 (en) Error-correcting dna barcodes
CN109415768B (en) Variable region sequence library construction method, sequencing method and kit thereof
KR101969905B1 (en) Primer set for library of base sequencing and manufacturing method of the library
CN104560982A (en) Artificial exogenous reference molecule for type and abundance comparison between different species of microorganisms
CN109136217A (en) A kind of method of sequencing library building builds library reagent and its application
CN108504651A (en) The library constructing method and reagent in library are built in PCR product large sample size mixing based on high-flux sequence
CN108707653B (en) Kit for constructing variable region sequence library and sequencing method of variable region sequence
CN113122616A (en) Method for amplifying and determining target nucleotide sequence
KR101953663B1 (en) Method for generating pool containing oligonucleotides from a oligonucleotide
US20210202032A1 (en) Method of tagging nucleic acid sequences, composition and use thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant