CN112466405B - Method for preparing molecular tag library for sequencing - Google Patents
Method for preparing molecular tag library for sequencing Download PDFInfo
- Publication number
- CN112466405B CN112466405B CN202011540460.5A CN202011540460A CN112466405B CN 112466405 B CN112466405 B CN 112466405B CN 202011540460 A CN202011540460 A CN 202011540460A CN 112466405 B CN112466405 B CN 112466405B
- Authority
- CN
- China
- Prior art keywords
- sequence
- floor
- molecular tag
- sequences
- temp2
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B50/00—ICT programming tools or database systems specially adapted for bioinformatics
-
- C—CHEMISTRY; METALLURGY
- C40—COMBINATORIAL TECHNOLOGY
- C40B—COMBINATORIAL CHEMISTRY; LIBRARIES, e.g. CHEMICAL LIBRARIES
- C40B50/00—Methods of creating libraries, e.g. combinatorial synthesis
- C40B50/06—Biochemical methods, e.g. using enzymes or whole viable microorganisms
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
Abstract
The invention discloses a method for preparing a molecular tag library for sequencing, wherein a molecular tag is prepared by connecting sequence units (B7 sequences) of 7 bases in series according to a certain mode, through calculation of a coding mathematical formula, the rightmost 3 bases of the B7 sequence are correction codes of the leftmost 4 bases, and any base in the B7 sequence has an error and can be corrected back to a correct coding sequence through a decoding correction mathematical formula. The method for constructing the molecular tag library can ensure that the molecular tag library has enough molecular tag types, can ensure that the molecular tag sequence is known and controllable, can ensure that the molecular tag sequence is correctable and correctable, is favorable for improving the accuracy of a sequencing result and the accuracy and specificity identification of target molecules in a sample.
Description
Technical Field
The invention belongs to the technical field of biotechnology detection, and particularly relates to a method for preparing a molecular tag library for sequencing.
Background
In the detection technology of DNA fragment molecules, not only the information of DNA fragments but also the number of original DNA fragment molecules are required to be known in some cases, however, due to the existence of amplification links, a large number of identical fragments appear, PCR amplifies original target molecules of hundreds to tens of thousands to dozens of times of 2, the method annihilates the information of the number of original DNA fragment molecules, simultaneously introduces amplification errors, sequencing errors and the like which cannot be identified and corrected, in order to more accurately obtain the information of original sequence information, the number of molecules and the like of DNA, scientists mark molecular labels on the original DNA fragment molecules, analyze the sequence and the number information of the original DNA fragment molecules by utilizing the uniqueness of the molecular labels, and in most cases, the molecular labels are formed by a plurality of N (N is A/C/T/G) or H (H is A/T/C) at random during synthesis, for example, the library is composed of 12H, so 3^12 ^ 531441 sequences can be obtained, the molecular tag library is simple to obtain, the molecular tags in the library are enough in possibility, but the molecular tags are not artificially controllable, the sequence AAA AAA AAA AAA, AAA AAA AAA AAT can appear, the 2 tag sequences cannot be distinguished at all in the case that the subsequent amplification error and the sequencing error objectively exist, the two tag differences are caused by the error, the two tag differences are self-carried in the molecular tag library, meanwhile, the randomly formed molecular tag library cannot artificially control the CG content, a large number of continuously identical bases (such as AAA AAA AAA AAA, AAA AAA AAA AAT) can cause potential troubles for some sequencing platforms, for example, the two sequences are difficult to identify in the PGM sequencing platform of Thermo company, which in turn leads to the loss of information, which is objectively present. An important premise for the application of molecular tags is that enough molecular tag species are needed, and the molecular tag species can be tens of thousands or hundreds of thousands, and the random synthesis of the Facultispie bases is a low-cost mode, but the sequence and the proportion are uncontrollable, and the synthesis of enough molecular tag sequences one by one is very uneconomical although the sequence and the proportion are controllable.
Disclosure of Invention
In order to solve the technical problems, the invention discloses a preparation method of a molecular tag library consisting of non-random sequences, wherein the molecular tag is prepared by connecting sequence units (B7 sequences) of 7 bases in series according to a certain mode, through calculation of a coding mathematical formula, the rightmost 3 bases of the B7 sequence are correction codes of the leftmost 4 bases, and any one base in the B7 sequence has an error, so that the correct coding sequence can be corrected back through decoding the correction mathematical formula.
The first object of the present invention is to provide a method for preparing a molecular tag library for sequencing, comprising the steps of:
s1, designing a molecular tag B7 sequence, wherein the B7 sequence is designed according to the following method:
defining the 7-base sequence of the B7 sequence as (a B c d x y z); wherein the content of the first and second substances,
a, b, c and d are information bits and represent the digital sequence converted from a randomly generated 4-bit base sequence consisting of the base A, T, G, C, and the base A, T, C, G is converted into the digital sequence by the following mode: a is 1, T is 2, G is 3, C is 4;
x, y and z are check bits and are obtained by converting a, b, c and d according to the following formula:
wherein floor is a down-rounding function;
s2, combining a plurality of different molecular tag sequences with the specific sequence to obtain the specific sequence containing a molecular tag library; the molecular tag sequence consists of n (E2+ B7+ F2) units, wherein E2 is 0-5 basic groups; f2 is 0-5 bases; n is any integer of 1 to 20.
Furthermore, in the n pieces (E2+ B7+ F2), the CG percent is 35-75 percent.
Further, the ratio of the number of the molecular tag sequences in the molecular tag library for sequencing to the number of the target molecules is more than 10: 1. The use of a ratio greater than 10:1 can satisfy the poisson distribution requirement, ensuring that each target molecule has a greater than 95% probability of having a unique molecular tag sequence attached.
Further, the step of combining the molecular tag sequence with the specific sequence specifically comprises the following steps:
s01, dividing the synthesized specific sequences into different shares, and synthesizing each sequence in the unit with the number n being 1 one by one on each specific sequence;
and S02, mixing the sequences synthesized in S01, dividing the sequences into different shares, further synthesizing each sequence in the (n) -2 th unit one by one, and repeating the steps in the same way, and synthesizing the specific sequences containing the molecular tag library according to the number requirement of the molecular tag sequences.
Further, the specific sequence is a PCR amplification primer, a hybridization probe, an isothermal extension primer or a connection primer.
Further, the molecular tag B7 sequence is any one of the following sequences:
the second purpose of the invention is to provide an error correction method for a molecular tag library, which comprises the following steps:
s001, setting temporary values temp1, temp2 and temp3, temp1 ═ a + b + d + x; temp2 ═ b + c + d + y; temp3 ═ a + c + d + z;
s002, evaluating whether the information bits of a, b, c and d have errors or not according to the values of temp1, temp2 and temp 3;
and S003, if an error occurs, completing self-checking, replacing by using a correct information bit, converting into a base information sequence, and outputting.
Further, the specific steps of evaluating whether the information bits a, b, c and d have errors according to the values of temp1, temp2 and temp3 in the step S002 are as follows:
if temp1-4 floor (temp1/4) is not equal to 2 and temp2-4 floor (temp2/4) is not equal to 2, an error occurs at position b; the correct value for b is calculated at this time:
b1=14-a-d-x-4*floor((14-a-d-x-1)/4),b=b1
b is replaced by b 1;
and set b2 ═ 14-c-d-y-4 flow ((14-c-d-y-1)/4)
If B1 ≠ B2, it indicates that two or more information bit errors occur, the decoding output cannot be completed, and the correction process of the current B7 sequence exits;
if temp2-4 floor (temp2/4) is not equal to 2 and temp3-4 floor (temp3/4) is not equal to 2, an error occurs at position c; the correct value for c is calculated at this time:
c1=14-a-d-z-4*floor((14-a-d-z-1)/4),c=c1
c is replaced by c 1;
and c2 ═ 14-b-d-y-4 flow ((14-b-d-y-1)/4)
If c1 ≠ c2, it indicates that two or more information bit errors occur, the decoding output cannot be completed, and the correction process of the current B7 sequence exits;
if temp1-4 floor (temp1/4) is not equal to 2 and temp3-4 floor (temp3/4) is not equal to 2, an error occurs at position a; the correct value for a is calculated at this time:
a1=14-b-d-x-4*floor((14-b-d-x-1)/4),a=a1
replacing a with a 1;
and the setting a2 ═ 14-c-d-z-4 ═ floor ((14-c-d-z-1)/4)
If a1 ≠ a2, it indicates that two or more information bit errors occur, the decoding output cannot be completed, and the correction process of the current B7 sequence exits;
if temp1-4 floor (temp1/4) is not equal to 2, temp2-4 floor (temp2/4) is not equal to 2, and temp3-4 floor (temp3/4) is not equal to 2, an error occurs at position d; at this time, the correct value of d is calculated
d1=14-a-b-x-4*floor((14-a-b-x-1)/4),d=d1
D is replaced by d 1;
and d2 is set to 14-b-c-y-4 floor ((14-b-c-y-1)/4);
d3=14-a-c-z-4*floor((14-a-c-z-1)/4);
if d1 ≠ d2 ≠ d3, it is said that two or more information bit errors occur, the decoding output cannot be completed, and the correction process of the current B7 sequence exits;
where floor is a floor rounding function.
By the scheme, the invention at least has the following advantages:
the method for constructing the molecular tag library can ensure that the molecular tag library has enough molecular tag types, can ensure that the molecular tag sequence is known and controllable, can ensure that the molecular tag sequence is correctable and correctable, is favorable for improving the accuracy of a sequencing result and the accuracy and specificity identification of target molecules in a sample.
The foregoing is a summary of the present invention, and in order to provide a clear understanding of the technical means of the present invention and to be implemented in accordance with the present specification, the following is a preferred embodiment of the present invention and is described in detail below.
Detailed Description
Example 1: scheme for coding B7 sequence
3 check bits are added at the right end of the 4-bit base information to form a B7 sequence, and the data of the 3 check bits are obtained by the following algorithm, so that when one bit in the B7 sequence has errors, the correct sequence can be corrected back
1) First, a 4-base sequence consisting of A, T, G, C is randomly generated, or a certain 4-base sequence, for example, TTGA;
2) converting the 4-bit base sequence from an alphabetic sequence to a numeric sequence, wherein if the base is A or a, the base is 1; the basic group is T or T, and then is 2; the base is G or G, then 3; base is C or C, then 4, for example, a 4-base sequence is TTGA, then 2231 is the result of conversion to a digital sequence;
3) converting into a sequence of 4 digits, which is defined as abcd, such as the digit sequence 2231, where a is 2, b is 2, c is 3, and d is 1, and sequentially obtaining information of 3 check digits by using the following conversion formula;
wherein the floor is a downward rounding function in Matlab;
4) adding parity bits to the end of abcd, resulting in a digitized B7 sequence: a B c d x y z, and then converted into a B7 letter sequence;
5) after obtaining the B7 letter sequence, we also need to examine the GC content and the degree of in-sequence repetition of the sequence, and only if the GC content is greater than 0.2 and less than 0.8 is taken as an output. In addition, the repetition degree in the sequence is too high and is not taken as an output.
The following is an implementation of the B7 sequence in Matlab:
the practical effect is that a set of encoded B7 sequences can be obtained, for example, 240 sequences in Table 1
Table 1 self-error-correctable 240 sequences
Example 2: decoding and error correction scheme for B7 sequences
1) This example is for the decoding of B7 sequence, requiring the input DNA sequence length to be an integer multiple of 7, and defining each 7 base sequence as (a B c d x y z), where a, B, c, d are information bits, x, y, z are check bits;
2) converting the base information sequence into a digital sequence according to A → 1, T → 2, G → 3, C → 4;
3) calculating temporary values temp1, temp2 and temp3, and respectively making temp1 be a + b + d + x; temp2 ═ b + c + d + y; temp3 ═ a + c + d + z
4) The error at each of the information bits a, b, c, d is then evaluated based on the values of temp1, temp2, temp 3:
if temp1-4 floor (temp1/4) is not equal to 2 and temp2-4 floor (temp2/4) is not equal to 2, then an error occurs at position b; at this time, the correct value of b is calculated
b1=14-a-d-x-4*floor((14-a-d-x-1)/4),b=b1
B is replaced by b 1. Another formula is used:
b2=14-c-d-y-4*floor((14-c-d-y-1)/4)
if b1 is not equal to b2, it indicates that two or more information bit errors occur, the decoding output cannot be completed, and the current 7-base sequence correction process is exited;
if temp2-4 floor (temp2/4) is not equal to 2 and temp3-4 floor (temp3/4) is not equal to 2, then an error occurs at position c; at this time, the correct value of c is calculated
c1=14-a-d-z-4*floor((14-a-d-z-1)/4),c=c1
C is replaced by c 1. Another formula is also used
c2=14-b-d-y-4*floor((14-b-d-y-1)/4)
If c1 is not equal to c2, the decoding output cannot be completed because two or more information bit errors occur, and the current 7-base sequence correction process is exited;
if temp1-4 floor (temp1/4) is not equal to 2 and temp3-4 floor (temp3/4) is not equal to 2, then an error occurs at position a; at this time, the correct value of a is calculated
a1=14-b-d-x-4*floor((14-b-d-x-1)/4),a=a1
A is replaced with a 1. Another formula is used:
a2=14-c-d-z-4*floor((14-c-d-z-1)/4)
if a1 is not equal to a2, the decoding output cannot be completed because two or more information bit errors occur, and the current 7-base sequence correction process is exited;
if temp1-4 floor (temp1/4) is not equal to 2, temp2-4 floor (temp2/4) is not equal to 2, and temp3-4 floor (temp3/4) is not equal to 2, then an error occurs at position d; at this time, the correct value of d is calculated
d1=14-a-b-x-4*floor((14-a-b-x-1)/4),d=d1
D is replaced by d 1. In addition, 2 formulas are used
d2=14-b-c-y-4*floor((14-b-c-y-1)/4);
d3=14-a-c-z-4*floor((14-a-c-z-1)/4);
If d1 is not equal to d2 is not equal to d3, the decoding output cannot be completed due to two or more information bit errors, and the current 7-base sequence correction process is exited;
wherein the floor is a downward rounding function in Matlab;
5) if an error occurs, the self-checking is completed, and the correct information bit is used for replacing and is converted into a base information sequence, so that the output can be realized.
The following is an implementation of the decoding and correction process of the B7 sequence in Matlab:
example 3: method for preparing molecular tag (E2+ B7+ F2) n by synthesis
The preparation procedure assuming that n is 4 is as follows:
1. preparation of specific primers with 331,776 molecular tag
a) A sufficient amount of the desired specific sequence FP, e.g., 5-GGACCCCCACACAGCAAA-3, is synthesized, and the number of molecules is divided into 24;
b) the sequence of round E2+ B7+ F2, e.g., the following 24 sequences (5 '-3'), was determined. These 24 sequences are synthesized one by one on the basis of each specific sequence, for example the 1st sequence ACaagggaaAC in the table below is synthesized on the basis of the 1st specific sequence FP, and so on. After the synthesis is finished, the number of molecules is equally divided into 24 parts again, and the n-th-2-round synthesis is prepared;
c) the sequences to be used in the n-2 th round are determined, for example, the following 24 sequences (5 '-3') are synthesized one by one on the basis of each of the n-1 th round mixtures, for example, the 1st sequence ACataattcAC in the following table is synthesized on the basis of the first n-1 th round mixture, after the synthesis is completed, the 24 n-2 th round sequences are obtained, and then the molecules are mixed in equal parts, and the molecules are further divided into 24 parts to prepare the n-3 th round synthesis;
1 | ACataattcAC | 7 | ACcaagtgtAC | 13 | ACgaatctcAC | 19 | ACtaatataAC |
2 | ACaatttaaAC | 8 | ACcagcctgAC | 14 | ACgacctcgAC | 20 | ACtaccgctAC |
3 | ACactaagtAC | 9 | ACccgggccAC | 15 | ACgcctataAC | 21 | ACtcactcgAC |
4 | ACacggtcaAC | 10 | ACcgcattcAC | 16 | ACggaccttAC | 22 | ACtccgagaAC |
5 | ACagcagtaAC | 11 | ACccattggAC | 17 | ACgtaggcaAC | 23 | ACtgtaccaAC |
6 | ACatctacgAC | 12 | ACcttgataAC | 18 | ACgcaatagAC | 24 | ACttagtccAC |
d) determining sequences to be used in the n-3 th round, for example, the following 24 sequences (5 '-3') were synthesized one by one on the basis of each of the n-2 th round mixtures to obtain 24 n-3 th round sequences in the same manner as in the previous round, and then the sequences were mixed by equal number of molecules, and the number of molecules was further divided into 24 parts to prepare the n-4 th round synthesis;
1 | ACatataatAC | 7 | ACcataatgAC | 13 | ACgaactatAC | 19 | ACtaacgagAC |
2 | ACaatcaggAC | 8 | ACcaggtacAC | 14 | ACgacgatcAC | 20 | ACtacgctaAC |
3 | ACacttgcaAC | 9 | ACcgaatgaAC | 15 | ACgcccgagAC | 21 | ACtcagatcAC |
4 | ACagaaggcAC | 10 | ACcgctaatAC | 16 | ACggagtaaAC | 22 | ACtcgagtcAC |
5 | ACagctcagAC | 11 | ACccaggttAC | 17 | ACgttacacAC | 23 | ACtgtcaacAC |
6 | ACatccgtcAC | 12 | ACctcaggcAC | 18 | ACgcatagcAC | 24 | ACttcaaggAC |
e) determining sequences to be used in the n-th-4 round, for example, the following 24 sequences (5 '-3') were synthesized one by one on the basis of each of the n-th-3 round mixtures, and in the same manner as in the previous round, 24 fourth round sequences were obtained after completion of the synthesis, and then mixed by equal number of molecules to prepare for synthesis of universal sequences;
1 | ACatacggaAC | 7 | ACcattgacAC | 13 | ACgaagagaAC | 19 | ACtaagcgcAC |
2 | ACaatggccAC | 8 | ACctaagtaAC | 14 | ACgagaggaAC | 20 | ACtagatgcAC |
3 | ACactcctgAC | 9 | ACcgatacgAC | 15 | ACgccgcgcAC | 21 | ACtctaggaAC |
4 | ACagatcctAC | 10 | ACcgccggaAC | 16 | ACggtaaccAC | 22 | ACtcgtcatAC |
5 | ACagcctgcAC | 11 | ACcctacggAC | 17 | ACgttcacaAC | 23 | ACtgcactgAC |
6 | ACatcgcatAC | 12 | ACctctcctAC | 18 | ACgcacgctAC | 24 | ACttctgccAC |
f) on the basis of the n-th-4 rounds of synthesized mixture, universal sequence tgt aaa acg acg gcc agt aca was further synthesized, so that a mixture of molecular tags (E2+ B7+ F2)4 with specific primers and universal sequences was obtained, wherein 24 × 24 × 24 × 24 ═ 331,776 molecular tags were included, the sequences of the molecular tags were known, the ratio between the molecular tags was 1:1, and the self-correcting function was achieved, and finally the FP sequence with the molecular tags was obtained as: 5-tgtaaaacgacggccagtaca (N44) GGACCCCCACACAGCAAA-3;
g) increasing the number of n can obtain longer molecular tag sequences, and the number of molecular tags also increases, for example, n is 5, and the number of molecular tags is 24 × 24 × 24 × 24 is 7,962,624; it is also possible to keep n equal to 4 and increase the number of the types (E2+ B7+ F2) in each round, for example, 36, i.e., 36 × 36 × 36 × 36 equal to 1,679,616.
2. Synthesis of specific sequence RP: 5-AAG TTA AAA TTC CCG TCG CTA TCA A-3 and the UNITag sequence: 5-tgt aaa acg acg gcc agt aca-3, mixing the FP sequence (UMI-FP) with 331,776 molecular tag synthesized above and the RP sequence according to the following system, and performing PCR amplification;
a) configuration of the 5 × Oligo mix System
Primer concentration (μ M) | Volume (μ L) | |
UMI-FP | 100 | 20 |
RP | 100 | 20 |
0.1×TE | Make up to 1000 μ L | |
Total | 1000μL |
b) Configuration of PCR System
Reagent composition | Volume (μ L) |
5×Oligo Mix | 6μL |
2×Taq Master Mix | 15μL |
Ultrasonic disruption of genomic DNA template | 10ng/30ng/100ng (3 DNA inputs each repeated 3 times) |
Nuclease Free Water | Make up to 30 mu L |
c) UMI-PCR amplification procedure
After the PCR was completed, 1 unit of exonuclease I was added to each reaction, and the reaction was incubated at 37 ℃ for 30 minutes and inactivated at 80 ℃ for 30 minutes. Additional 2. mu.L of 10. mu.M RP and 2. mu.L of 10. mu.M UNITag were added for the subsequent PCR amplification procedure.
d) PCR amplification procedure
e) The three 10ng/30ng/100ng 9-tube amplification products are subjected to library construction and sequencing by using a commercial Illumina library construction kit, the diversity of molecular tags is analyzed finally, the reads number of each molecular tag needs to be more than 6 to be counted as one molecular tag, the statistical data is shown in the following table, and the molecular tag library prepared by the embodiment can save or correct about 10% of effective data on average according to the analysis result, so that the effect is very obvious.
Example 4: process method for constructing (E2+ B7+ F2) n molecular tag library by using connection method
1. 30 (E2+ B7+ F2)2 sequences with a CG% content of 50% were selected, wherein E2 and F2 are 0 bases as shown in the following table:
ID | Seq | ID | Seq |
HMB401 | aacggttaagacgg | HMB416 | agtctagatccgtc |
HMB402 | aacttggaatggcc | HMB417 | atacggaatggcga |
HMB403 | aagggaaacaccca | HMB418 | atcgcatattcgcg |
HMB404 | aagttccaccgtgt | HMB419 | atctacgcaaccac |
HMB405 | aatcaggacctgtg | HMB420 | atgatcgcacgttg |
HMB406 | acagttgacggtca | HMB421 | atgcgatcactggt |
HMB407 | acatggtacgtgac | HMB422 | attgctccaggtac |
HMB408 | acgaatgactcctg | HMB423 | caagtgtcagtgca |
HMB409 | actgtacagaaggc | HMB424 | caatgtgcatccgt |
HMB410 | acttgcaagcgact | HMB425 | cacaaacccaggtt |
HMB411 | agagaagagctcag | HMB426 | cagaagtccattgg |
HMB412 | agatcctaggagag | HMB427 | catgtcaccgactt |
HMB413 | agcagtaaggctct | HMB428 | cattgaccctggaa |
HMB414 | agggataagtgagc | HMB429 | ccaacaacctttcc |
HMB415 | agtagctatagccg | HMB430 | ccgttaacgagcat |
synthesis of PO3-aaccaccaccaaca + HMB # + accaacaaaccacc sequences, 30 in total, uniformly mixed according to equal molecular number for standby, and marked as UMIseq 30.
500 sequences (E2+ B7+ F2)2 were selected, where E2 and F2 ═ 0 bases, as shown in the table below
AGACGTGTGCTCTTCCGATCTATCA + HMB # + aaccaccaccaaca sequences are synthesized, and the total number of the sequences is 500, the molecules are evenly mixed for standby, and the sequences are marked as UMIseq 500.
3. The following sequences were synthesized:
sequence name | Sequence (5 '-3') |
Primer 1st Stem complementation | ggtggtttgttggtggtggtttgttggt |
1st-2nd stemComplementary to each other | tgttggtggtggtttgttggtggtggtt |
Wherein, the last base T at the 3 terminal is the ddT modified by dideoxy.
4. Synthesis of the following Table specific primer sequences
Wherein, 5 ends of all the sequences in the table are modified by PO3 phosphate groups.
5. Connecting step
Primer sequences with molecular tags of 37 species were obtained by adjusting the sequence of each specific primer, primer 1st stem complementation, 1st-2nd stem complementation, UMIseq30 and UMIseq500 to a concentration of 2. mu.M based on the total molecular weight, mixing them at a volume ratio of 1:2:2:2:2, ligating them with a commercial ligase kit and performing the procedures as recommended by the manufacturer, wherein 30X 500-15,000 species are assigned to each primer. It should be noted that, according to the principle of poisson distribution, it is not sufficient to label 10ng of about 3000 copies of a molecule with 15,000 molecular tags, and it may not be sufficient to label each of 3000 molecules with a unique tag, but for a low proportion of mutant molecules, say 1%, there are only 30 molecules, and then the labeling of 30 molecules with 15,000 sequences is sufficient to label each mutant molecule with a unique tag, so this example is more suitable for the field of tumor detection or for detecting low proportions of target molecules. The amplification of the molecular tag species in this example is also simple, i.e., the number of species in the 1st stem sequence and the 2nd stem sequence can be increased, e.g., to 40 species and 1000 species, respectively, and the number of the finally obtained molecular tag species is equal to 40,000.
The above is only a preferred embodiment of the present invention, and is not intended to limit the present invention, it should be noted that, for those skilled in the art, many modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be regarded as the protection scope of the present invention.
Claims (8)
1. A method for preparing a molecular tag library for sequencing is characterized by comprising the following steps:
s1, designing a molecular tag B7 sequence, wherein the B7 sequence is designed according to the following method:
defining the 7-base sequence of the B7 sequence as (a B c d x y z); wherein the content of the first and second substances,
a, b, c and d are information bits and represent digital sequences converted from 4-base sequences randomly formed by the base A, T, G, C, and the base A, T, C, G is converted into the digital sequences in a mode that: a is 1, T is 2, G is 3, C is 4;
x, y and z are check bits and are obtained by converting a, b, c and d according to the following formula:
wherein floor is a down-rounding function;
s2, combining a plurality of different molecular tag sequences with the specific sequence to obtain the specific sequence containing a molecular tag library; the molecular tag sequence consists of n (E2+ B7+ F2) units, wherein E2 is 0-5 basic groups; f2 is 0-5 bases; n is any integer of 1 to 20.
2. The method of claim 1, wherein the CG% of the n (E2+ B7+ F2) is between 35% and 75%.
3. The method of claim 1, wherein the ratio of the number of molecular tag sequences in the molecular tag library for sequencing to the number of target molecules is greater than 10: 1.
4. The method of claim 1, wherein the step of binding the molecular tag sequence to the specific sequence comprises the steps of:
s01, dividing the synthesized specific sequences into different shares, and synthesizing each sequence in the unit with the number n being 1 one by one on each specific sequence;
and S02, mixing the sequences synthesized in S01, dividing the sequences into different shares, further synthesizing each sequence in the (n) -2 th unit one by one, and repeating the steps in the same way, and synthesizing the specific sequences containing the molecular tag library according to the number requirement of the molecular tag sequences.
5. The method of claim 1, wherein the specific sequence is a PCR amplification primer, a hybridization probe, an isothermal extension primer, or a ligation primer.
7. an error correction method for a molecular tag library is characterized by comprising the following steps:
s001, setting temporary values temp1, temp2 and temp3, temp1 ═ a + b + d + x; temp2 ═ b + c + d + y; temp3 ═ a + c + d + z;
wherein a, B, c, d, x, y and z represent 7 bases of the sequence of molecular tag B7,
a, b, c and d are information bits and represent digital sequences converted from 4-base sequences randomly formed by the base A, T, G, C, and the base A, T, C, G is converted into the digital sequences in a mode that: a is 1, T is 2, G is 3, C is 4;
x, y and z are check bits and are obtained by converting a, b, c and d according to the following formula:
wherein floor is a down-rounding function;
s002, evaluating whether the information bits of a, b, c and d have errors or not according to the values of temp1, temp2 and temp 3;
and S003, if an error occurs, completing self-checking, replacing by using a correct information bit, converting into a base information sequence, and outputting.
8. The method for correcting errors in molecular tag library according to claim 7, wherein the step S002 comprises the steps of evaluating whether the information bits a, b, c and d have errors according to the values of temp1, temp2 and temp 3:
if temp1-4 floor (temp1/4) is not equal to 2 and temp2-4 floor (temp2/4) is not equal to 2, an error occurs at position b; the correct value for b is calculated at this time:
b1=14-a-d-x-4*floor((14-a-d-x-1)/4),b=b1
b is replaced by b 1;
and set b2 ═ 14-c-d-y-4 flow ((14-c-d-y-1)/4)
If B1 ≠ B2, it indicates that two or more information bit errors occur, the decoding output cannot be completed, and the correction process of the current B7 sequence exits;
if temp2-4 floor (temp2/4) is not equal to 2 and temp3-4 floor (temp3/4) is not equal to 2, an error occurs at position c; the correct value for c is calculated at this time:
c1=14-a-d-z-4*floor((14-a-d-z-1)/4),c=c1
c is replaced by c 1;
and c2 ═ 14-b-d-y-4 flow ((14-b-d-y-1)/4)
If c1 ≠ c2, it indicates that two or more information bit errors occur, the decoding output cannot be completed, and the correction process of the current B7 sequence exits;
if temp1-4 floor (temp1/4) is not equal to 2 and temp3-4 floor (temp3/4) is not equal to 2, an error occurs at position a; the correct value for a is calculated at this time:
a1=14-b-d-x-4*floor((14-b-d-x-1)/4),a=a1
replacing a with a 1;
and the setting a2 ═ 14-c-d-z-4 ═ floor ((14-c-d-z-1)/4)
If a1 ≠ a2, it indicates that two or more information bit errors occur, the decoding output cannot be completed, and the correction process of the current B7 sequence exits;
if temp1-4 floor (temp1/4) is not equal to 2, temp2-4 floor (temp2/4) is not equal to 2, and temp3-4 floor (temp3/4) is not equal to 2, an error occurs at position d; at this time, the correct value of d is calculated
d1=14-a-b-x-4*floor((14-a-b-x-1)/4),d=d1
D is replaced by d 1;
and d2 is set to 14-b-c-y-4 floor ((14-b-c-y-1)/4);
d3=14-a-c-z-4*floor((14-a-c-z-1)/4);
if d1 ≠ d2 ≠ d3, it is said that two or more information bit errors occur, the decoding output cannot be completed, and the correction process of the current B7 sequence exits;
where floor is a floor rounding function.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011540460.5A CN112466405B (en) | 2020-12-23 | 2020-12-23 | Method for preparing molecular tag library for sequencing |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011540460.5A CN112466405B (en) | 2020-12-23 | 2020-12-23 | Method for preparing molecular tag library for sequencing |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112466405A CN112466405A (en) | 2021-03-09 |
CN112466405B true CN112466405B (en) | 2021-06-22 |
Family
ID=74803373
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011540460.5A Active CN112466405B (en) | 2020-12-23 | 2020-12-23 | Method for preparing molecular tag library for sequencing |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112466405B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114774516B (en) * | 2022-03-28 | 2024-04-12 | 深圳裕康医学检验实验室 | UMI sequence design method for correcting sequencing errors and application thereof |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2004083819A2 (en) * | 2003-03-17 | 2004-09-30 | Trace Genetics, Inc | Molecular forensic specimen marker |
CN105119717A (en) * | 2015-07-21 | 2015-12-02 | 郑州轻工业学院 | DNA coding based encryption system and encryption method |
CN106086162A (en) * | 2015-11-09 | 2016-11-09 | 厦门艾德生物医药科技股份有限公司 | A kind of double label joint sequences for detecting Tumor mutations and detection method |
CN107365861A (en) * | 2017-08-28 | 2017-11-21 | 华中农业大学 | A kind of molecular labeling for differentiating dark-brown cotton |
CN108932401A (en) * | 2018-06-07 | 2018-12-04 | 江西海普洛斯生物科技有限公司 | It is a kind of be sequenced sample identification method and its application |
CN109337966A (en) * | 2017-08-01 | 2019-02-15 | 上海禀远生物科技有限公司 | A kind of molecular label and its reagent and application |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2001062935A1 (en) * | 2000-02-25 | 2001-08-30 | Villoo Morawala Patell | A process for constructing dna based molecular marker for enabling selection of drought and diseases resistant germplasm screening |
US7155453B2 (en) * | 2002-05-22 | 2006-12-26 | Agilent Technologies, Inc. | Biotechnology information naming system |
EP2619327B1 (en) * | 2010-09-21 | 2014-10-22 | Population Genetics Technologies LTD. | Increasing confidence of allele calls with molecular counting |
CN109326322B (en) * | 2018-08-17 | 2020-12-08 | 华中科技大学 | Method and system for comparing QTL (quantitative trait loci) among different segregation groups of crops |
-
2020
- 2020-12-23 CN CN202011540460.5A patent/CN112466405B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2004083819A2 (en) * | 2003-03-17 | 2004-09-30 | Trace Genetics, Inc | Molecular forensic specimen marker |
CN105119717A (en) * | 2015-07-21 | 2015-12-02 | 郑州轻工业学院 | DNA coding based encryption system and encryption method |
CN106086162A (en) * | 2015-11-09 | 2016-11-09 | 厦门艾德生物医药科技股份有限公司 | A kind of double label joint sequences for detecting Tumor mutations and detection method |
CN109337966A (en) * | 2017-08-01 | 2019-02-15 | 上海禀远生物科技有限公司 | A kind of molecular label and its reagent and application |
CN107365861A (en) * | 2017-08-28 | 2017-11-21 | 华中农业大学 | A kind of molecular labeling for differentiating dark-brown cotton |
CN108932401A (en) * | 2018-06-07 | 2018-12-04 | 江西海普洛斯生物科技有限公司 | It is a kind of be sequenced sample identification method and its application |
Non-Patent Citations (4)
Title |
---|
Advances in molecular marker techniques and their applications in plant sciences;Milee Agarwal 等;《Plant Cell Rep》;20080202;617-631页 * |
ATGC transcriptomics: a web-based application to integrate, explore and analyze de novo transcriptomic data;Sergio Gonzalez 等;《BMC Bioinformatics》;20170222;第1-9页 * |
一种基于卷积码模型的遗传序列分析方法;刘晓 等;《西北农林科技大学学报(自然科学版)》;20100410;第38卷(第4期);第207-214页 * |
基因测序"黑科技"给生命来个完整的"数字化解读";张佳星;《科技日报》;20190625;第1-2页 * |
Also Published As
Publication number | Publication date |
---|---|
CN112466405A (en) | 2021-03-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110945595B (en) | DNA-based data storage and retrieval | |
Faircloth et al. | Not all sequence tags are created equal: designing and validating sequence identification tags robust to indels | |
Hasegawa et al. | Heterogeneity of tempo and mode of mitochondrial DNA evolution among mammalian orders | |
Organick et al. | Scaling up DNA data storage and random access retrieval | |
Tippery et al. | Evaluation of phylogenetic relationships in L emnaceae using nuclear ribosomal data | |
CN112466405B (en) | Method for preparing molecular tag library for sequencing | |
Michel et al. | Bijective transformation circular codes and nucleotide exchanging RNA transcription | |
AU2015286672B2 (en) | Methods and products for quantifying RNA transcript variants | |
JP6664575B2 (en) | Nucleic acid molecule counting method | |
CN107155361A (en) | Code generating method, code generating unit and computer-readable recording medium | |
Sierro et al. | Whole genome profiling physical map and ancestral annotation of tobacco H icks B roadleaf | |
CN112382340A (en) | Coding and decoding method and coding and decoding device for binary information to base sequence for DNA data storage | |
CN109797438A (en) | A kind of joint component and library constructing method quantifying sequencing library building for the variable region 16S rDNA | |
CN101845500B (en) | Method for correcting sequence abundance deviation of secondary high-flux sequence test by DNA sequence bar codes | |
CN112749247A (en) | Text information storage and reading method and device | |
WO2019204702A1 (en) | Error-correcting dna barcodes | |
CN109415768B (en) | Variable region sequence library construction method, sequencing method and kit thereof | |
KR101969905B1 (en) | Primer set for library of base sequencing and manufacturing method of the library | |
CN104560982A (en) | Artificial exogenous reference molecule for type and abundance comparison between different species of microorganisms | |
CN109136217A (en) | A kind of method of sequencing library building builds library reagent and its application | |
CN108504651A (en) | The library constructing method and reagent in library are built in PCR product large sample size mixing based on high-flux sequence | |
CN108707653B (en) | Kit for constructing variable region sequence library and sequencing method of variable region sequence | |
CN113122616A (en) | Method for amplifying and determining target nucleotide sequence | |
KR101953663B1 (en) | Method for generating pool containing oligonucleotides from a oligonucleotide | |
US20210202032A1 (en) | Method of tagging nucleic acid sequences, composition and use thereof |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |