CN113736777A - Design and synthesis method of nucleic acid coding probe for high-throughput sequencing - Google Patents

Design and synthesis method of nucleic acid coding probe for high-throughput sequencing Download PDF

Info

Publication number
CN113736777A
CN113736777A CN202111073314.0A CN202111073314A CN113736777A CN 113736777 A CN113736777 A CN 113736777A CN 202111073314 A CN202111073314 A CN 202111073314A CN 113736777 A CN113736777 A CN 113736777A
Authority
CN
China
Prior art keywords
sequence
nucleic acid
probe
coding
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111073314.0A
Other languages
Chinese (zh)
Inventor
刘鹏
吴俣帅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN202111073314.0A priority Critical patent/CN113736777A/en
Publication of CN113736777A publication Critical patent/CN113736777A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12NMICROORGANISMS OR ENZYMES; COMPOSITIONS THEREOF; PROPAGATING, PRESERVING, OR MAINTAINING MICROORGANISMS; MUTATION OR GENETIC ENGINEERING; CULTURE MEDIA
    • C12N15/00Mutation or genetic engineering; DNA or RNA concerning genetic engineering, vectors, e.g. plasmids, or their isolation, preparation or purification; Use of hosts therefor
    • C12N15/09Recombinant DNA-technology
    • C12N15/11DNA or RNA fragments; Modified forms thereof; Non-coding nucleic acids having a biological activity
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6869Methods for sequencing

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Organic Chemistry (AREA)
  • Genetics & Genomics (AREA)
  • Wood Science & Technology (AREA)
  • Zoology (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Engineering & Computer Science (AREA)
  • Biotechnology (AREA)
  • Microbiology (AREA)
  • Physics & Mathematics (AREA)
  • Biochemistry (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Plant Pathology (AREA)
  • Analytical Chemistry (AREA)
  • Immunology (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention discloses a design and synthesis method of a nucleic acid coding probe used in a high-throughput sequencing library construction process. The invention provides a nucleic acid coding probe group, which consists of a plurality of nucleic acid coding probes; each of the nucleic acid encoding probes comprises an adaptor sequence, a plurality of sample encoding systems, a Read2 sequence, a UMI sequence, and a sample capture sequence; the multiple sample coding systems of each nucleic acid coding probe are composed of different sample label sequences and different micro-pit coding sequences; and the plurality of sample coding systems of each nucleic acid coding probe are different; and the sample tag sequence and the pit coding sequence are separated by a Read2 sequence; the length of the coding sequence of the micro-pit is larger than that of the sample label sequence; the number of different bases in different said coding sequences of the micro-pits is greater than 2. The coding probe design of the invention greatly improves the coding flux and supports the synchronous sequencing of more samples.

Description

Design and synthesis method of nucleic acid coding probe for high-throughput sequencing
Technical Field
The invention belongs to the technical field of biology, and relates to a method for designing and synthesizing a nucleic acid coding probe, in particular to a method for designing and synthesizing a nucleic acid coding probe used in a high-throughput sequencing library construction process.
Background
High throughput transcriptome sequencing is technically very advantageous. Transcriptome refers to the collection of all transcripts in a cell, the number of which is generally determined by the time period and physiological conditions in which the cell is located. Transcriptome sequencing may help researchers to more fully understand and understand gene function, the mechanism of action of signaling pathways, the molecular composition of cells or tissues, and the pathogenesis of disease. Compared to other transcriptome research methods (e.g., hybridization assays), transcriptome sequencing has the following advantages: 1. the detection range is wide. The method is not limited by known target genes, and can detect and discover some unknown genes and non-coding RNA; 2. the resolution is high. The Single base can be identified, which means that the method has incomparable advantages in the research fields of allele specific expression (Allespecific expression), variable splicing (Alternative splicing), Single Nucleotide Polymorphisms (SNPs) and the like; the reading obtained by sequencing is a digital signal, and particularly for low-abundance genes, more accurate counting can be obtained.
High-throughput transcriptome sequencing has wide application. The main applications include the following four aspects: 1. gene expression level analysis and differential expression analysis; 2. digging a new gene; 3. gene structure analysis and functional annotation; 4. and (3) analyzing single nucleotide polymorphism. Each aspect plays an important role in understanding and researching the intrinsic mechanism of biological processes and the occurrence and development of life individuals by scientific researchers. In addition, with the 'upgrade' of high-throughput transcriptome sequencing technologies, more precise and advanced related sequencing technologies emerge like spring shoots after rain. For example, sequencing of single-cell transcriptome, the difference condition of the transcriptome among cells is researched from the level of single cells, the influence of cell heterogeneity in the traditional research method on the research result is eliminated, and a new visual field and a new space are opened for the research of basic biology, developmental biology, neurobiology and the like.
Nucleic acid-encoded probes are a key factor in increasing the throughput of transcriptome sequencing samples. With the continuous improvement and development of sequencing technology, the throughput of single sequencing is continuously improved, and in a sense, the sequencing is not a main factor for limiting the throughput. It is more critical if as much of the sample transcription information as possible is obtained by one sequencing run. This requires precise encoding of each sample during the sequencing library construction process. In addition, researchers also desire to achieve absolute counts of transcripts, and more accurate quantification of the transcriptional information thus obtained-this can be achieved by adding a Unique Molecular Identifier (UMI) sequence to the coded probes. Therefore, the design of nucleic acid probes and the quality of synthesis are of critical importance. A set of perfect design, excellent performance's nucleic acid probe not only can promote the sensitivity and the precision of sequencing, can also greatly promote the flux of sample, from "in the time of the people and property" each side reduce the cost of experiment, promote sequencing efficiency.
The challenge of coded probe design is mainly two-fold: the specific design of each part of functional sequence in the coded probe and the combination mode of each part of functional sequence. In fact, the overcoming of these two challenges is sometimes even contradictory. For example, if it is desired to have a longer sample coding sequence in the coded probe (the more the sample coding sequence, the more the number of samples that can be coded), it will inevitably lead to an increase in the complexity of combining the sample coding sequence with other functional sequences. Therefore, the coded probes are not designed to be arranged and overlapped in a linear way, but are designed and planned globally and dynamically after fully considering and measuring the action, the requirement and the interaction relation of each part of functional sequences.
Specifically, the functional sequences of the coded probes include: linker sequences, sample coding sequences, molecular unique identification sequences, probe capture sequences, and other helper sequences. Among them, the sample coding sequence is one of the most important functional sequences, and its design has been studied by many researchers. Especially in single cell transcriptome sequencing, sample coding capacities need to be in the order of one hundred thousand or even more than a million. Various high-throughput single cell sequencing methods comprise that a set of sequences with millions of capacity sample codes are respectively designed by Drop-seq and Microwell-seq, and the sequences support the coding and sequencing of massive single cells. However, the existing sample coding sequences have the following two problems. First, excessive pursuit of throughput results in too small a hamming distance between sample coding sequences, which is prone to coding recognition errors. Second, the coding sequences are only required to separate cells, and a particular coding sequence is difficult to map to cells, resulting in a lack of phenotypic information for downstream cellular analysis. These two points restrict the further development of transcriptome sequencing, and ultimately, are caused by the design problem of coded probes.
The difficulty in the synthesis of coded probes is mainly caused by the complexity of coded probes. Especially in transcriptome sequencing, coded probes face three major difficulties of long sequence, multiple types and complex terminal modification. The three difficulties lead to high cost and low efficiency of the conventional probe synthesis method and can not meet the requirements.
The nucleic acid coding probe has many and long functional sequences and is difficult to synthesize. The currently widely used "solid phase phosphoramidite synthesis" is performed by sequentially cycling detritylation-activation-coupling-capping-oxidation-detritylation, adding one base at a time. Each base is blocked with a protecting group during the process, preventing extension, until removal after addition is complete. The synthesis process is complicated, and the synthesis efficiency is rapidly reduced with the increase of the process. When a section of DNA with 60 bases is synthesized, the success rate is only 74 percent; when it exceeds 200 bases, the success rate is not higher than 37% in theory. In actual operation, high-purity long-chain sequences can hardly be obtained successfully due to various uncontrollable reasons.
The coding probe core is coding, which means how many codes are needed and how many coding probes need to be synthesized. Therefore, with the increasing demand of coding capacity, the traditional probe synthesis method will increase a lot of synthesis cost and trial and error cost. In addition, the increased number of probes is a challenge to accurate decoding. A more efficient coding design and synthesis would be a great help to the cost, coding and decoding process.
In general, coded probes require specific modifications in different functional sequence portions, including: biotin modifications, phosphorylation, TEG, thio modifications, fluorescein modifications, and even some structural modifications such as U-type. The sequence modification process of the probe is complex and high in cost; in particular, the fact that the coding probes are long in sequence and many in type means that it is difficult to modify long sequences of various coding probes. Under the condition, the traditional sequence synthesis method not only greatly increases the cost, but also can not obtain a probe sequence with high quality.
Therefore, there is a need for a new method for designing and synthesizing nucleic acid encoding probes.
Disclosure of Invention
It is an object of the present invention to provide a nucleic acid encoding probe set.
The nucleic acid coding probe group provided by the invention consists of m multiplied by n nucleic acid coding probes; m and n are integers greater than or equal to 2;
each nucleic acid encoding probe comprises a sequencing joint, a sample label sequence (primary code), a Read2 sequence, a micro-pit coding sequence (secondary code), a UMI sequence and a sample capture sequence; the sample label sequence and the micro-pit coding sequence form a sample coding system of the nucleic acid coding probe;
the sample coding system of each nucleic acid coding probe is different;
the nucleic acid coding probe group contains m sample tag sequences and n micro-pit coding sequences, so that the number of the sample coding sequences in the nucleic acid coding probe group is m multiplied by n; m is less than n;
the sample tag sequence and the micropit coding sequence are spaced apart on both sides of the Read2 sequence in the nucleic acid-encoded probe;
the length of the coding sequence of the micro-pit in each nucleic acid coding probe is larger than that of the sample label sequence;
the number of the difference bases in the n types of micro-pit coding sequences is more than 2.
In the nucleic acid coding probe set, the sample tag sequence consists of 6 random bases, and each sample tag sequence satisfies the condition that the number of continuous bases is less than or equal to 2;
the micro-pit coding sequence consists of 10-20 random bases, the number of the different bases of different micro-pit coding sequences is more than 2, and each micro-pit coding sequence satisfies the condition that the number of the continuous bases is less than or equal to 2;
the UMI sequence consists of 10 random bases;
the sample capture sequence consists of Poly-T, V and N; wherein said V is any other 3 random bases except T, and said N is a random base;
the random base is A, T, C or G.
In the nucleic acid coding probe set, a space sequence is further connected to the upstream of the sequencing linker, the space sequence is composed of 5-10T, and the first base of the space sequence is modified by biotin (so that the 5' end of the probe is modified by biotin).
In the above nucleic acid encoding probe set, the nucleic acid encoding probe comprises the following components in order from the 5' end: the space sequence, the sequencing linker, the sample tag sequence, the Read2 sequence, the pit coding sequence, the UMI, the Poly-T, the V, and the N.
In the nucleic acid coding probe set, the nucleotide sequence of the sequencing linker is sequence 1 in a sequence table;
the sample tag sequence is CGTGAT, ACATCG, GCCTAA or TGGTCA;
the nucleotide sequence of the Read2 sequence is the sequence 2 in the sequence table.
In an embodiment of the invention, in the above-mentioned nucleic acid encoding probe set, the nucleic acid encoding probe is any one of the following groups:
1) a nucleic acid coding probe formed by connecting P2-1 to P2-96 respectively through P1-A;
2) a nucleic acid coding probe formed by connecting P2-1 to P2-96 respectively through P1-B;
3) a nucleic acid coding probe formed by connecting P2-1 to P2-96 respectively through P1-C;
4) a nucleic acid coding probe formed by connecting P2-1 to P2-96 respectively through P1-D;
and the last base of the P1-A, the P1-B, the P1-C, or the P1-D nucleotide sequence in each nucleic acid encoding probe is adjacent to the first base of the nucleotide sequence of any one of the P2-1 to P2-96.
Another object of the present invention is to provide a method for synthesizing the above nucleic acid-coding probe.
The method provided by the invention comprises the following steps:
1) designing the nucleic acid coding probes in the nucleic acid coding probe set, and separating each nucleic acid coding probe from any 2 bases in Read2 (ensuring that the number of bases of Read2-1 and Read2-2 after splitting is not large), wherein a sequence close to the 5' end of the nucleic acid coding probe is named as P1, and the rest sequence is named as P2;
2) respectively synthesizing P1, corresponding P2 and linker of each nucleic acid encoding probe;
the 5' end of the P2 is modified by phosphorylation;
the 5' end of the P1 is labeled with biotin (namely, the biotin modification of the first base of the space sequence), and is used for realizing the capture of a sample to be detected by combining with streptavidin modified magnetic beads;
the linker is reverse complementary to Read2 in the nucleic acid encoding probe;
3) and connecting the P1 of each nucleic acid coding probe, the corresponding P2 of each nucleic acid coding probe and the linker under the action of T4 ligase to obtain a connection product, namely the nucleic acid coding probe.
In the method, in the step 3), the molar amount of the P1 in the connection system is more than that of the P2;
or, in the step 3), the molar amount of the P1 in the connection system is larger than that of the P2.
In an embodiment of the present invention, the molar weight ratio of the P1 and the P2 in the above process is 3:1 or 1: 3.
In step 3), if the molar amount of P1 is less than the molar amount of P2, the method further comprises the following steps: purifying the ligation product.
The use of the probe or the probe prepared by the method for capturing a target fragment;
or, the use of the probe or the probe prepared by the method in high throughput sequencing;
or, the application of the probe or the probe prepared by the method in preparing a product for capturing target fragments;
or, the application of the probe or the probe prepared by the method in preparing a high-throughput sequencing product;
alternatively, the present invention provides a system for synthesizing the probe, comprising the above P1, the P2, the linker and the T4 ligase.
The invention aims to provide a design and synthesis method for a nucleic acid coding probe for high-throughput sequencing. On one hand, a coding probe sequence with better performance is obtained by optimizing the design of the probe; on the other hand, an innovative coded probe synthesis method is provided. Through the combination of the two parts, researchers can obtain a nucleic acid coding probe with more excellent performance and lower cost.
Due to the adoption of the technical scheme, the invention has the following advantages:
1. the coding probe design of the invention greatly improves the coding flux and supports the synchronous sequencing of more samples.
2. The coding probe design of the invention expands the coding library, allows researchers to select coding probe sets with proper GC content, fewer repeated sequences and larger Hamming distance, and obtains more accurate and reliable sequencing results.
3. The coding probe design of the invention greatly reduces the coding cost, and the cost reduction effect is more obvious along with the increase of the coding quantity, so that the coding probe is not only suitable for small laboratories with insufficient budget, but also suitable for large laboratories needing higher flux.
4. The coded probe is convenient, rapid and accurate in decoding design, reduces the trouble of letter generation analysis, and is more friendly to extensive researchers without letter generation bases.
5. The method for synthesizing the coded probe can synthesize the coded probe with long sequence, multiple types and complex terminal modification, and has extremely wide applicability.
Drawings
FIG. 1 is a schematic diagram of a coded probe structure.
FIG. 2 is a schematic diagram of a T4 DNA ligase synthesized nucleic acid encoding probe.
FIG. 3 is a scheme for coded probe synthesis.
FIG. 4 is a spatial distribution plot of sequencing reads for each probe under protocol one synthesis.
FIG. 5 is a spatial distribution plot of sequencing reads for each probe under the protocol two synthesis method.
FIG. 6 is a statistical plot of sequencing reads for each probe under the protocol two synthesis method.
FIG. 7 shows the alignment of sequencing data for each probe under the synthesis protocol II.
FIG. 8 shows a flow cytometry method for verifying the synthesis of coded probes (T4 ligation method).
FIG. 9 shows PCR gel electrophoresis of bacterial suspension.
Detailed Description
The experimental procedures used in the following examples are all conventional procedures unless otherwise specified.
Materials, reagents and the like used in the following examples are commercially available unless otherwise specified.
Example 1 design of high throughput sequencing nucleic acid encoding probes
The high-throughput sequencing nucleic acid coding probe is shown in a schematic structural diagram in FIG. 1, and comprises the following elements from the 5' end:
1) sequencing joint
Embodiments of the invention employ secondary sequencing of Illumina, and thus the sequencing linker is P7 sequencing linker (5'-CAAGCAGAAGACGGCATACGAGAT-3', sequence 1).
The 5 'end of the sequencing adaptor is connected with a space sequence (in the embodiment of the invention, the space sequence is TTTTT, generally 5-10 Ts are suitable), and the first base T of the space sequence is subjected to biotin modification at the 5' end, and the space sequence is used for providing space when the probe is connected with streptavidin modified magnetic beads during capture, so that primer connection in the subsequent PCR amplification process is convenient.
2) Sample tag sequences
The sample tag sequence is a primary code of a sample, is composed of 6 random bases (A, T, C or G), is 6bp in length, ensures that each sample tag in the same sequencing is different, and regulates the GC content of each sample tag according to a sequencing platform.
In the examples of the present invention, the sample tag sequences are specifically exemplified by CGTGAT, ACATCG, GCCTAA and TGGTCA.
The number of sample tag sequences is not limited to the number of samples, and can be combined with the following pit encoding to obtain a total encoded number of suitable throughput.
3) Read2 sequence
Read2 sequence GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT (sequence 2)
The examples of the invention used Illumina sequencing, and the primers (helper sequences) used for Read2 sequencing were as follows: AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC, respectively;
4) coding sequence of micro-pit
The micro-pit coding sequence is a secondary coding of the sample and is matched with the primary coding to form a sample coding system with a larger scale and a better coding effect;
the coding sequence of the micro-pits consists of 10-20 random bases (A, T, C or G), the number of the different bases of different micro-pit codes is more than 2 (in order to increase the Hamming distance between coding sequences), the GC content of each micro-pit code is about 50% (in order to adapt to the second generation sequencing of Illumina), and each micro-pit code satisfies less than or equal to 2 continuous base repeats.
In the present example, the micro-pit code consists of 16 bases, and is 96 different sequences (see Table 2 for details) 16bp long, such as CTTCCGATCTCGACAC for one of the sequences.
5) UMI sequence
The UMI sequence is composed of 10 random bases (randomly chosen in ATCG) and is used as a basis for absolute quantitation.
6) Sample capture sequence
The sample capture sequence comprises Poly-T, which is capable of specifically binding to Poly-A at the 3' end of the mRNA, thereby capturing the mRNA. And two bases, V and N, are also present after Poly-T, where V is the random base of the other 3 bases excluding T and N is the random base of ATCG, and these two bases are intended to specifically capture mRNA with Poly-T.
Second, design concept of nucleic acid coding probe
Overall, such a design of coded probes is a basic structure that fully considers the requirements of sequencing libraries. For example, P7 linker sequence is required at the 5' end, functional sequence of core is generally placed after the sequencing primer (Read2), and the position design of the micro-pit code, UMI and Poly-T can conveniently identify the sequence of each part.
Locally, the most important part is the coding part of the coding probe, and the advanced multi-level coding is innovatively adopted, namely, the coding accuracy is ensured through the cross matching of the first-level coding (sample label) and the second-level coding (micro-pit coding), on one hand, the coding capacity is greatly increased, and on the other hand, the decoding efficiency and accuracy are effectively improved. This is because, in the case of multi-level coding, the capacity of the coding library is equal to the product of the coding capacities of the levels. For example, if the primary coding capacity is M and the secondary coding capacity is N, a coded probe library of M × N coding capacities can be obtained by cross recombination. The benefits of such a design include the following three aspects:
first, the coding capacity is greatly increased, and coding probe sequences with better performance can be selected from a coding library. In fact, not all probes in the coding library are suitable for use as codes. In both the sequencing library construction and the in-machine sequencing, coded probes undergo a series of molecular reactions, such as PCR, RT, and ligation. In the reaction of molecular biology, the GC content of a probe sequence is higher or lower, the repeated sequence is excessive, and the self-complementary structure can cause unpredictable influence on the reaction. Therefore, it is necessary to select an appropriate probe sequence, and the increase in coding capacity can provide a greater degree of freedom for selection. In addition, the lengthy library-building sequencing steps accumulate sequence errors, such as base mismatches, linker mismatches, and may result in cross-scrambling of the codes. This can be overcome by increasing the Hamming distance between coding sequences (the number of base differences between each coding sequence), and selecting coding probes with larger Hamming distances to form the final available coding library.
Second, the coding cost is greatly reduced, and the cost of (M × N) can be replaced by the cost of (M + N). As described above, coded probes have characteristics of long sequences, multiple types, complex terminal modification, and the like, and the cost increases exponentially with the increase in the types of probes.
A schematic diagram of a T4 DNA ligase synthesized nucleic acid coding probe is shown in FIG. 2, and the cost can be remarkably reduced by splitting a complete long sequence into two short sequences to be synthesized respectively and then utilizing ligase to connect.
For example, the unit cost for synthesizing a coded probe of 114bp in full length as shown in FIG. 1 is 1000 yuan, and thus 400,000 yuan is required for synthesizing 400 coded probes; if the split synthesis shown in FIG. 2 is performed, the unit cost of P1 of 55bp is 200 yuan, and the unit cost of P2 of 59bp is 100 yuan, 400 coded probes can be synthesized by combining 10P 1 and 40P 2, and the final cost is 10 × 200+40 × 100 ═ 6000 yuan, which is almost 100 times different from each other! With further increase of the coding capacity, the cost reduction effect is more obvious.
Thirdly, the decoding is convenient, fast and accurate. In fact, in the decoding process, only the micro-pit code (secondary code) needs to be obtained by splitting, and the sample label (primary code) is obtained by sequencing with a special primer in the sequencing process. Therefore, the design of the multi-level coding can not only shorten the decoding time, but also effectively improve the decoding accuracy. This is because, in general, the primary coding capacity is smaller than the secondary coding; in other words, the sequence of the sample tags is shorter than the micro-pit code. The longer the sequence, the more errors are accumulated during the sequencing process and the more recognition errors are generated during decoding. By the design of multi-level coding, the decoding efficiency and accuracy can be effectively improved, and accurate sequencing data can be acquired more efficiently.
Example 2 establishment of a method for synthesizing nucleic acid-encoding probes
After the structural design and the splitting and synthesizing scheme of the coded probe are determined, the specific synthesizing method and conditions of the coded probe need to be explored and optimized. T4 DNA ligase synthesizes nucleic acid coding probe as shown in figure 2, the complete nucleic acid coding probe long sequence is separated into two short sequences (P1 and P2) to be synthesized separately, and the two parts can be synthesized into a complete capture probe under the action of T4 DNA ligase.
For the following nucleic acid-encoded probes (containing no UMI sequence, since sanger sequencing requires sequence determination, and no UMI addition is required for the validation experiment):
TTTTTTTCAAGCAGAAGACGGCATACGAGATCGTGATGTGACTGGAGTTCAGACG (P1) TGTGCTCTTCCGATCTCGACACGGTTTGGGCCNNNNNNNNNNTTTTTTTTTTTTTTTTVN (P2) as shown in fig. 3, two sets of protocols were first identified by theoretical design and experimental verification.
1. First set of scheme
The first set of protocol was to ensure that P2 was in excess of P1, and after ligation of P1 and P2 using T4 ligase, purification was required using a purification kit. This is because qpcr results show significant differences in CT values for the unpurified coded probes compared to the purified coded probes.
The specific method comprises the following steps:
1) the nucleic acid coding probe to be synthesized is divided into 2 sections of Read2-1 and Read2-2 from Read2, the sequence consisting of biotin modification at the 5' end, P7 sequencing joint, sample tag and Read2-1 is named as P1, and the sequence consisting of Read2-2, micro pit code, UMI and Poly-T is named as P2.
linker is the reverse complementary sequence to Read2, sequence GAAGAGCACACGTCTGAAC.
Respectively synthesizing P1, P2 and linker, carrying out phosphorylation modification on the last base at the 5 'end of the P2 sequence, and carrying out biotin labeling on the 5' end of P1; (ii) a
P1: TTTTTTT (space sequence) CAAGCAGAAGACGGCATACGAGAT (sequencing linker)CGTGAT(Sample label)GTGACTGGAGTTCAGACG(Read2-1);
P2:TGTGCTCTTCCGATCT(Read2-2)CGACACGGTTTGGGCC(pit-encoding) NNNNNNNNNN(UMI) TTTTTTTTTTTTTTTT (Poly-T) VN (V is a random base of 3 bases excluding T, N is a random base of ATCG)
2) Preparation of nucleic acid-encoded probes by ligating P1 and P2
100uL ligation system: mixing 3uL of P1 at 10uM concentration, 1uL of P2 at 10uM concentration, 3uL of linker at 10uM concentration, 5uL of T4 ligase (NEB, M0202L) at 100U/uL concentration and 10uL of ligase Buffer, with the balance being enucleated acid water; wherein the volume ratio of P1 to P2 is 3: 1.
And (3) allowing the ligation system to stay overnight at 16 ℃, collecting the ligation products, purifying by using a QIAquick PCR Purification Kit (Qiagen) Kit, and collecting the purified products, namely the nucleic acid coding probe.
3)qpcr
The ligation product before purification and the purified product after purification in 2) above were used as templates and coding probes, respectively, to capture mRNA obtained by lysis of Jurkat cells (lysis buffer (200mM Tris-HCl,20mM EDTA, 1% Sarkosyl,50mM DTT), and qpcr amplification was performed using P7/TSO primers.
Forward 5’-CAAGCAGAAGACGGCATACGAG-3’
Reverse 5’-AAGCAGTGGTATCAACGCAGAGT-3’
As shown in the left panel of FIG. 3, it can be seen that the CT value of the coded probe without purification is increased compared to that after purification, thus demonstrating that purification using a purification kit is required when P2 is in relatively excessive amount.
2. Second set of scheme
The second set of protocol ensures that P1 is in excess of P2, and after the P1 and P2 are linked by T4 ligase, the product can be stored for later use without purification. This is because qpcr results show no significant difference in CT values for the unpurified coded probes compared to the purified coded probes, demonstrating that the presence or absence of purification has no clear effect when P1 is in a relative excess.
The specific method comprises the following steps:
1) the nucleic acid coding probe to be synthesized is divided into 2 segments of Read2-1 and Read2-2 from Read2, the sequence consisting of biotin modification at the 5' end, P7 sequencing joint, sample tag and Read2-1 is named as P1, and the sequence consisting of Read2-2, micro pit code, UMI and Poly-T is named as P2.
linker is the reverse complementary sequence to Read2, sequence GAAGAGCACACGTCTGAAC.
P1: TTTTTTT (space sequence) CAAGCAGAAGACGGCATACGAGAT (sequencing linker)CGTGAT(Sample label)GTGACTGGAGTTCAGACG(Read2-1);
P2:TGTGCTCTTCCGATCT(Read2-2)CGACACGGTTTGGGCC(micro-pit coding) NNNNNNNNNN (UMI) TTTTTTTTTTTTTTTT (Poly-T) VN (V is the random base of the other 3 bases except T, N is the random base of ATCG)
2) Preparation of nucleic acid-encoded probes by ligating P1 and P2
100uL system: mixing 3uL of P1 with a concentration of 10uM, 1uL of P2 with a concentration of 10uM, 3uL of linker with a concentration of 10uM, 5uL of T4 ligase with a concentration of 100U/uL and 10uL of ligase Buffer, and the balance being enucleated acid water; wherein the volume ratio of P1 to P2 is 3: 1;
and (3) allowing the ligation system to stay overnight at 16 ℃, collecting the ligation products, purifying by using a QIAquick PCR Purification Kit (Qiagen) Kit, and collecting the purified products, namely the nucleic acid coding probe.
3)qpcr
The ligation product before purification and the purified product after purification in 2) above were used as coding probes, respectively, to capture mRNA obtained by lysis of Jurkat cells, and qpcr amplification was performed using P7/TSO primers.
Forward 5’-CAAGCAGAAGACGGCATACGAG-3’
Reverse 5’-AAGCAGTGGTATCAACGCAGAGT-3’
The results are shown in the right panel of fig. 3, and it can be seen that there is no significant difference in CT values of the purified coded probes compared with those without purification, demonstrating that when P1 is relatively excessive, whether purification is temporarily not clearly affected.
Example 3 design of nucleic acid encoding probes and Experimental validation of the Synthesis protocol
Design of nucleic acid coding probe
According to the protocol of one of example 1, synthetic nucleic acid-coded probes were designed:
the nucleic acid encoding probe is obtained by connecting P1 in Table 1 with P2 in Table 2 respectively, and the last base of P1 is adjacent to the first base of P2.
Such as: P1-A in Table 1 is respectively linked with P2-1 to P2-96 in Table 2 to form 96 nucleic acid coding probes;
P1-B in Table 1 is respectively linked with P2-1 to P2-96 in Table 2 to form 96 nucleic acid coding probes;
P1-C in Table 2 is respectively linked with P2-1 to P2-96 in Table 2 to form 96 nucleic acid coding probes;
P1-D in Table 3 was ligated to P2-1 to P2-96 in Table 2, respectively, to form 96 nucleic acid encoding probes;
the inventive example synthesizes 384 nucleic acid coded probes.
Table 1 shows the P1 and linker sequences of nucleic acid encoding probes
Figure BDA0003261191130000101
In the above table 1, the first 7T from the 5' end in P1-A to P1-D are space sequences, and the first T is a biotin modification; the underlined 6 bases are the sample tag, between the sample code and the space sequence is the sequencing linker, followed by Read 2-1.
TABLE 2P 2 for nucleic acid encoding probes
Figure BDA0003261191130000102
Figure BDA0003261191130000111
Figure BDA0003261191130000121
Figure BDA0003261191130000131
Figure BDA0003261191130000141
In Table 2 above, the first 16 bases from the 5' end of P2-1 to P2-96 correspond to Read2-2, 10N are UMI sequences, the first 16 bases of UMI sequences are crater coding sequences, the last 16T of UMI sequences are Poly-T, and the last V is a random base of the other 3 bases excluding T, and N is a random base of ATCG.
Synthesis of nucleic acid encoding probes
A series of experiments are adopted to determine the feasibility of the coded probe design and synthesis method, including qpcr, second-generation sequencing, flow cytometry, gel electrophoresis and first-generation sequencing.
Qpcr: as mentioned above, after validation of the qpcr experiment, two temporally feasible sets of synthesis protocols were obtained, specifically which set of protocols required further experimentation.
1. Second generation sequencing
1) Synthesis of nucleic acid encoding probes
192 coded probes (exemplified by probes synthesized corresponding to P1-A and P1-B and other P2) were synthesized according to the methods of scheme one and scheme two in example 2, respectively, and used in sequencing experiments of A549 cells (ATCC, CCL-185). The most important indicator here is the homogeneity of the 192 probes, i.e.whether the synthesis protocol ensures that all probes are synthesized uniformly and equally.
2) Hybridization of probes
A549 cells were lysed with the above lysis solution to obtain a cell lysate.
0.2uL of the nucleic acid encoding probe synthesized above (0.1umol/l) was distributed to the corresponding micro-wells of the Superhydrophobic chip (Wang, Y., Wu, Y., Chen, Y., Zhang, J., Chen, X.and Liu, P. (2018) Nanolite Centrifugal Liquid Coupled with Superhydrophic micro well arrays Chips for High-through High-Throughput Cell assays. Micromachiens (Basel),9.) and, after ligation with an excess (40uL,5mg/ml) of M270 magnetic beads (Thermo Fisher), a total of 40uL volume of Cell lysate was added.
The reaction was carried out at room temperature for 20 minutes.
After the probe sufficiently captures mRNA released by the cells, the magnetic beads are recovered by a pipette washing method, and the liquid containing the magnetic beads is collected.
The capture effect of the nucleic acid-encoded probes was evaluated by qpcr amplification using a liquid containing magnetic beads as a template and primers MTND4L, ACTB, PPP1R26, and GPR17, and the specific primer information is shown in Table 3 below.
Table 3 shows the primer sequences
Figure BDA0003261191130000142
Figure BDA0003261191130000151
The qpcr amplification product was sent to Illumina sequencing.
The results are as follows:
the spatial distribution of the sequencing reads of the coded probes synthesized according to scheme one is shown in fig. 4, and it can be seen that the probes under the scheme one synthesis method have poor uniformity and no clear correspondence with the spatial positions of the 192-well plate.
The spatial distribution of the sequencing reading of the coded probes synthesized in scheme two is shown in fig. 5, and it can be seen that the probes have better uniformity under the synthesis method in scheme two.
Fig. 6 and 7 are a statistical chart of sequencing readings corresponding to 192 probes in the second synthesis scheme and an alignment ratio of sequencing data corresponding to each probe in the second synthesis scheme, respectively, and it can be seen that the detected gene number and gene alignment ratio show that the probe uniformity in the second synthesis scheme is better.
From the above data, scheme two was identified as a better synthetic scheme, i.e. by ensuring that P1 is in excess with respect to P2 and no purification steps are introduced.
2. Flow cytometry
The coded probes synthesized using protocol two were compared to the synthetic full-length probes (ordered complete nucleic acid-encoding probes) using flow cytometry.
The following probes are taken as examples:
TTTTTTTCAAGCAGAAGACGGCATACGAGATCGTGATGTGACTGGAGTTCAGACG(P1)TGTGCTCTTCCGATCTCGACACGGTTTGGGCCNNNNNNNNNNTTTTTTTTTTTTTTTTVN(P2)
a549 cells were lysed with the above lysis solution to obtain a cell lysate.
The assay was divided into 4 groups:
bead plus effective T4 ligated encoding probe set (FIG. 8, right panel below): distributing the nucleic acid encoding probes synthesized in the second scheme above into micro-pits (Wang, Y., Wu, Y., Chen, Y., Zhang, J., Chen, X.and Liu, P. (2018) corresponding to a super-hydrophobic chip, connecting with an excess (0.2 × 5mg) of M270 magnetic beads (Thermo Fisher) and then adding a total of 40uL of Cell lysis product after connecting with the excess (0.2 × 5mg) of Superhydrophic Array Chips for High-Throughput Cell assays. micromachines (Basel), 9.); reacting for 20 minutes at room temperature; after the probe sufficiently captures mRNA released by the cells, the magnetic beads are recovered by a pipette washing method, and the liquid containing the magnetic beads is collected.
Bead plus null T4 ligated encoding probe set (left panel of fig. 8): the only differences between the encoding probe set after ligation of the magnetic beads plus validation T4 are: replacing the nucleic acid coding probe synthesized in the second scheme with a coding probe obtained by connecting magnetic beads and invalid T4;
the preparation method of the coded probe after the magnetic bead plus ineffective T4 ligation is basically the same as that in the second scheme, and the difference is that no phosphorylation modification is carried out on the 5' end of the P2.
Only magnetic bead group (negative control): the only differences between the encoding probe set after ligation of the magnetic beads plus validation T4 are: no nucleic acid encoding probe is added;
ordering a complete capture probe set: the only differences between the encoding probe set after ligation of the magnetic beads plus validation T4 are: replacing the synthesized nucleic acid encoding probe with an artificially synthesized full-length probe;
wherein, only the 3' end of P2 is labeled with a fluorescent group as a basis for determining whether P1 and P2 are connected (only P1 can be connected with magnetic beads).
As shown in FIG. 8, the four graphs respectively represent only the magnetic beads (upper left graph in FIG. 8), the magnetic beads plus the capture probes (upper right graph in FIG. 8), the magnetic beads plus the encoding probes ligated with the invalid T4 (lower left graph in FIG. 8), and the magnetic beads plus the encoding probes ligated with the valid T4 (lower right graph in FIG. 8), and it can be seen that the probes synthesized in scheme two are almost identical in peak pattern except for the probes slightly lower in intensity than the full-length probes synthesized in the production. This indicates that scheme two synthesizes full-length probes, but slightly interspersed with small fragments.
3. Gel electrophoresis
And amplifying the coded probe synthesized in the second scheme by pcr, and detecting by gel electrophoresis after purification.
Full-length coded probes were synthesized by the method of scheme two above, using the following P1 and P2 sequences: TTTTTTTCAAGCAGAAGACGGCATACGAGATCGTGATGTGACTGGAGTTCAGACG and TGTGCTCTTCCGATCTCGACACGGTTTGGGCCTTTTTTTTTTTTTTTTVN (containing no UMI sequence, since sanger sequencing requires sequence determination, and verification experiments do not require addition of UMI).
The nucleotide sequence of the probe synthesized by the second scheme is as follows: TTTTTTTCAAGCAGAAGACGGCATACGAGATCG TGATGTGACTGGAGTTCAGACG(P1)TGTGCTCTTCCGATCTCGACACGGTTTGGGCCTTTTTTTTTTTTTTTTVN(P2)
The above nucleic acid probe was constructed as a plasmid as a template, and the sequences at both ends were used as primers (the primer sequences were as above, P7 and TSO) for amplification, and the amplified sequences were purified and subjected to gel electrophoresis.
The results are shown in FIG. 9, wherein 9 lanes represent the 9 replicates obtained after plasmid amplification; under the condition of ensuring that the marker and NC are normal, the 9 lanes all have obvious fragments of about 110bp, and the fact that the scheme II synthesizes the pre-designed full-length probe is verified.
4. First generation sequencing
Gel fractions containing nucleic acid fragments in lanes 5, 6 and 7 of gel electrophoresis in lane 3 above were excised and used for Sanger sequencing. Sanger sequencing gave a final sequence identical to the sequence of the pre-designed full-length probe.
Thus, the above results indicate that the nucleic acid encoding probe synthesized in scheme two was successfully synthesized.
SEQUENCE LISTING
<110> Qinghua university
<120> design and synthesis method of nucleic acid coding probe for high-throughput sequencing
<160> 2
<170> PatentIn version 3.5
<210> 1
<211> 24
<212> DNA
<213> Artificial sequence
<400> 1
caagcagaag acggcatacg agat 24
<210> 2
<211> 34
<212> DNA
<213> Artificial sequence
<400> 2
gtgactggag ttcagacgtg tgctcttccg atct 34

Claims (10)

1. The nucleic acid coding probe group consists of m multiplied by n nucleic acid coding probes; m and n are integers greater than or equal to 2;
each nucleic acid encoding probe comprises a sequencing joint, a sample label sequence, a Read2 sequence, a micro pit coding sequence, a UMI sequence and a sample capture sequence; the sample label sequence and the micro-pit coding sequence form a sample coding system of the nucleic acid coding probe;
the sample coding system of each nucleic acid coding probe is different;
the nucleic acid coding probe group contains m sample tag sequences and n micro-pit coding sequences, so that the number of the sample coding sequences in the nucleic acid coding probe group is m multiplied by n; m is less than n;
the length of the coding sequence of the micro-pit in each nucleic acid coding probe is larger than that of the sample label sequence;
the number of the difference bases in the n types of micro-pit coding sequences is more than 2.
2. The nucleic acid encoding probe set of claim 1, wherein:
the sample tag sequences consist of 6 random bases, and each sample tag sequence satisfies less than or equal to 2 continuous base repeats;
the micro-pit coding sequence consists of 10-20 random bases, the number of the different bases of different micro-pit coding sequences is more than 2, and each micro-pit coding sequence satisfies the condition that the number of the continuous bases is less than or equal to 2;
the UMI sequence consists of 10 random bases;
the sample capture sequence consists of Poly-T, V and N; wherein said V is any other 3 random bases except T, and said N is a random base;
the random base is A, T, C or G.
3. The nucleic acid encoding probe set of claim 1 or 2, wherein:
the upstream of the sequencing joint is also connected with a space sequence, the space sequence consists of 5-10T, and the first base of the space sequence is modified by biotin.
4. The nucleic acid encoding probe of any one of claims 1-3, wherein:
the nucleic acid coding probe sequentially consists of the following components from the 5' end: the space sequence, the sequencing linker, the sample tag sequence, the Read2 sequence, the pit coding sequence, the UMI sequence, the Poly-T, the V, and the N.
5. The nucleic acid encoding probe of any one of claims 1-4, wherein:
the nucleotide sequence of the sequencing joint is a sequence 1 in a sequence table;
the sample tag sequence is CGTGAT, ACATCG, GCCTAA or TGGTCA;
the nucleotide sequence of the Read2 sequence is the sequence 2 in the sequence table.
6. A method of synthesizing a nucleic acid coded probe according to any one of claims 1 to 5, comprising the steps of:
1) designing the nucleic acid encoding probes in the nucleic acid encoding probe set of any one of claims 1-5, and separating each of the nucleic acid encoding probes from any 2 bases in Read2, wherein the sequence near the 5' end of the nucleic acid encoding probe is named as P1, and the remaining sequence is named as P2;
2) respectively synthesizing P1, corresponding P2 and linker of each nucleic acid encoding probe;
the 5' end of the P2 is modified by phosphorylation;
the linker is reverse complementary to Read2 in the nucleic acid encoding probe;
3) and connecting the P1 of each nucleic acid coding probe, the corresponding P2 of each nucleic acid coding probe and the linker under the action of T4 ligase to obtain a connection product, namely the nucleic acid coding probe.
7. The method of claim 6, wherein: in the step 3), the molar amount of the P1 in the connection system is larger than that of the P2;
or, in the step 3), the molar amount of the P1 in the connection system is larger than that of the P2.
8. The method according to claim 6 or 7, characterized in that: in step 3), if the molar amount of P1 is less than the molar amount of P2, the method further comprises the steps of: purifying the ligation product.
9. Use of a probe according to any one of claims 1 to 5 or prepared by a method according to any one of claims 6 to 8 for capturing a fragment of interest;
or, use of the probe of any one of claims 1 to 5 or prepared by the method of any one of claims 6 to 8 in high throughput sequencing;
or, the use of a probe according to any one of claims 1 to 5 or prepared by a method according to any one of claims 6 to 8 for the preparation of a product for capturing a fragment of interest;
or, the use of the probe according to any one of claims 1 to 5 or prepared by the method according to any one of claims 6 to 8 for the preparation of a high throughput sequencing product.
10. A system for synthesizing the probe of any one of claims 1 to 5, comprising the P1, the P2, the linker, and the T4 ligase of any one of claims 6 to 8.
CN202111073314.0A 2021-09-14 2021-09-14 Design and synthesis method of nucleic acid coding probe for high-throughput sequencing Pending CN113736777A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111073314.0A CN113736777A (en) 2021-09-14 2021-09-14 Design and synthesis method of nucleic acid coding probe for high-throughput sequencing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111073314.0A CN113736777A (en) 2021-09-14 2021-09-14 Design and synthesis method of nucleic acid coding probe for high-throughput sequencing

Publications (1)

Publication Number Publication Date
CN113736777A true CN113736777A (en) 2021-12-03

Family

ID=78738612

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111073314.0A Pending CN113736777A (en) 2021-09-14 2021-09-14 Design and synthesis method of nucleic acid coding probe for high-throughput sequencing

Country Status (1)

Country Link
CN (1) CN113736777A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020047005A1 (en) * 2018-08-28 2020-03-05 10X Genomics, Inc. Resolving spatial arrays
CN112805389A (en) * 2018-10-01 2021-05-14 贝克顿迪金森公司 Determination of 5' transcript sequences
CN113106150A (en) * 2021-05-12 2021-07-13 浙江大学 Ultrahigh-throughput single cell sequencing method
US20210254143A1 (en) * 2017-11-27 2021-08-19 The Trustees Of Columbia University In The City Of New York RNA Printing and Sequencing Devices, Methods, and Systems

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210254143A1 (en) * 2017-11-27 2021-08-19 The Trustees Of Columbia University In The City Of New York RNA Printing and Sequencing Devices, Methods, and Systems
WO2020047005A1 (en) * 2018-08-28 2020-03-05 10X Genomics, Inc. Resolving spatial arrays
CN112805389A (en) * 2018-10-01 2021-05-14 贝克顿迪金森公司 Determination of 5' transcript sequences
CN113106150A (en) * 2021-05-12 2021-07-13 浙江大学 Ultrahigh-throughput single cell sequencing method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
PAUL DATLINGER 等: "Ultra-high-throughput single-cell RNA sequencing and perturbation screening with combinatorial fluidic indexing" *
蒋忻怡: "一种新型单细胞测序分析平台——Microwell-seq的建立" *

Similar Documents

Publication Publication Date Title
Zhong et al. High-throughput illumina strand-specific RNA sequencing library preparation
JP7033602B2 (en) Barcoded DNA for long range sequencing
CN102732629B (en) Method for concurrently determining gene expression level and polyadenylic acid tailing by using high-throughput sequencing
US9334532B2 (en) Complexity reduction method
CN102124126A (en) Cdna synthesis using non-random primers
JP7332733B2 (en) High molecular weight DNA sample tracking tags for next generation sequencing
CN111808854B (en) Balanced joint with molecular bar code and method for quickly constructing transcriptome library
EP1660674A4 (en) Expression profiling using microarrays
CN105734048A (en) PCR-free sequencing library preparation method for genome DNA
Rani et al. Transcriptome profiling: methods and applications-A review
CN106520917A (en) Gene large fragment deletion/duplication detection method
US20060063181A1 (en) Method for identification and quantification of short or small RNA molecules
JP2022160425A (en) Method for collective quantification of target proteins using next-generation sequencing and uses thereof
CN111979307A (en) Targeted sequencing method for detecting gene fusion
KR20170133270A (en) Method for preparing libraries for massively parallel sequencing using molecular barcoding and the use thereof
KR20180041331A (en) The method and kit of the selection of Molecule-Binding Nucleic Acids and the identification of the targets, and their use
US20170362641A1 (en) Dual polarity analysis of nucleic acids
EP2333104A1 (en) RNA analytics method
US20220002797A1 (en) Full-length rna sequencing
CN114875118B (en) Methods, kits and devices for determining cell lineage
CN113736777A (en) Design and synthesis method of nucleic acid coding probe for high-throughput sequencing
DK2456892T3 (en) Procedure for sequencing of a polynukleotidskabelon
WO2024077439A1 (en) Single-cell transcriptome and chromatin accessibility dual-omics sequencing library contruction method and sequencing method
CN117844906A (en) Reverse transcription adapter primer, library-building sequencing method of LncRNA and application
WO2023130049A1 (en) Next-generation sequencing for protein measurement

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination