WO2023035143A1 - 一种高质量的3'RNA-seq建库方法及其用途 - Google Patents

一种高质量的3'RNA-seq建库方法及其用途 Download PDF

Info

Publication number
WO2023035143A1
WO2023035143A1 PCT/CN2021/117183 CN2021117183W WO2023035143A1 WO 2023035143 A1 WO2023035143 A1 WO 2023035143A1 CN 2021117183 W CN2021117183 W CN 2021117183W WO 2023035143 A1 WO2023035143 A1 WO 2023035143A1
Authority
WO
WIPO (PCT)
Prior art keywords
sequence
seq
rna
bases
library
Prior art date
Application number
PCT/CN2021/117183
Other languages
English (en)
French (fr)
Inventor
鲁非
王静
徐俊
杨晓寒
Original Assignee
中国科学院遗传与发育生物学研究所
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中国科学院遗传与发育生物学研究所 filed Critical 中国科学院遗传与发育生物学研究所
Priority to PCT/CN2021/117183 priority Critical patent/WO2023035143A1/zh
Publication of WO2023035143A1 publication Critical patent/WO2023035143A1/zh

Links

Images

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6806Preparing nucleic acids for analysis, e.g. for polymerase chain reaction [PCR] assay
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6876Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
    • CCHEMISTRY; METALLURGY
    • C40COMBINATORIAL TECHNOLOGY
    • C40BCOMBINATORIAL CHEMISTRY; LIBRARIES, e.g. CHEMICAL LIBRARIES
    • C40B50/00Methods of creating libraries, e.g. combinatorial synthesis
    • C40B50/06Biochemical methods, e.g. using enzymes or whole viable microorganisms

Definitions

  • the invention relates to a high-quality 3'RNA-seq library construction method and its application, which can be used for accurate detection of gene expression levels in large-scale high-throughput populations.
  • RNA sequencing is a key technology in modern biological research, which transforms the study of many species from a single genome level to a multidimensional omics level, thus effectively improving our understanding of biological genomics.
  • RNA-seq RNA sequencing
  • the whole genome sequencing of many crops has been completed, resulting in a large amount of genome data, for example, genetic variation maps of wheat, maize, rice, cassava, potato and soybean, etc.
  • many important crops have also assembled High-quality pan-genome.
  • a large amount of genomics data research has formed a research vacuum, which needs to be filled with a large amount of transcriptome data to help decode genome function. Therefore, efficient RNA-seq technology is becoming more and more important for biological research.
  • SiPAS V2 poly(A ) anchored sequencing
  • Paired-End (PE) sequencing is to sequence both ends of a DNA template fragment and generate high-quality, comparable sequence data. Paired-end sequencing reads are divided into read1 (R1, connected to Illumina P5 sequencing adapter) and read2 (R2, connected to Illumina P7 sequencing adapter) according to the sequence of sequencing.
  • R1 connected to Illumina P5 sequencing adapter
  • R2 connected to Illumina P7 sequencing adapter
  • the Illumina high-throughput sequencing platform adopts the standard paired-end 150bp sequencing mode.
  • the Illumina sequencing platform requires the base synthesis reaction time of each molecular cluster to be consistent. Since the enzyme activity and other activities gradually decrease with the reaction, the base synthesis reaction in the molecular cluster will be inconsistent, so the base quality of the R1 terminal is higher than that of the R2 terminal.
  • the reported 3'RNA-seq method uses custom double-ended Sequencing (read length: R1 ⁇ R2 ⁇ 150bp) to reduce the impact of continuous read poly(T) bases on sequencing quality, where R1 (poly(T) end) only sequenced barcode sequences, R2 (non-poly(T) end) for full-length 150bp sequencing for sequence comparison analysis.
  • R1 poly(T) end
  • R2 non-poly(T) end
  • the analysis uses specific molecular recognition sequences (UMI, Unique Molecular Identifiers) for reads count.
  • UMI Unique Molecular Identifiers
  • the improved bulk RNA (normal RNA starting amount, such as 0.5 ⁇ g or more) library construction method retains UMI markers.
  • the presence of UMI sequences increases the length of primers and synthesis costs, and reduces the effective sequence length. The correction effect of the normal RNA input library has not been evaluated.
  • the present invention tested whether UMI is valuable for 3'RNA-seq of normal RNA input, and found that UMI is not necessary for 3'RNA-seq, and reversed After recording the primer without the UMI sequence, the synthesis cost of each primer can be reduced by about 150 yuan.
  • the present invention reverses the sequencing adapters during library construction, that is, connects the P5 adapter to the non-poly (T) end, and connects the P7 adapter to the poly (T) end, so that the non-poly (T) end is first sequenced in the subsequent sequencing process. T) end, and then sequence the poly(T) end, making it more suitable for paired-end 150bp sequencing, which improves the simplicity of the library construction method and the utilization rate of data, and can more accurately quantify gene expression.
  • the transfer of the linker is realized by improving the reverse transcription primer.
  • the sequence of the reverse transcription primer comprises a general sequence P7 joint-(barcode)(T) n VN; wherein, the general sequence P7 joint is as shown in SEQ ID NO:97 sequence, or the sequence obtained by deleting any 1 or any 2-4 consecutive bases in the sequence shown in SEQ ID NO:97.
  • the general sequence P7 linker is most preferably the sequence shown in SEQ ID NO: 97, that is, 22 bases, and a partial sequence is selected from the linker sequence (ie, several bases are deleted Base) can also successfully complete the reverse transcription reaction, but the reduction of the universal sequence length of the linker will lead to a decrease in complementary paired bases when PCR amplification and PCR primers anneal to the reverse transcription product, thereby reducing the efficiency of PCR, which is lower than 18 bases, it will cause difficulty in complementary pairing with the universal sequence of the adapter when the primer is annealed, and it is impossible to complete the library amplification. Therefore, in the present invention, the universal sequence P7 adapter can be 18-22 in length in the sequence shown in SEQ ID NO:97 The sequence of the base part, which can also realize the reverse transcription reaction.
  • n in the reverse transcription primer is any integer from 12 to 35, preferably 21.
  • the poly(T) length is 12-35 bases, reverse transcription can be performed.
  • the minimum T base length of commonly used reverse transcription primers is 12bp, and the poly(T) length is short, and it is easy to continuously T bases in the mRNA Mismatched reverse transcription occurs at the position, and the increase of poly(T) length can effectively reduce the reverse transcription inside the mRNA.
  • the poly(T) length is more than 35bp, the primer binding rate and reverse transcription efficiency will decrease during reverse transcription.
  • n is any integer of 12 to 35, namely poly (T) length is 12-35 bases , all can carry out reverse transcription, and n is most preferably 21.
  • the universal sequence P5 linker in the two-strand synthetic primer is the sequence shown in SEQ ID NO: 98, or any one of the sequences shown in SEQ ID NO: 98 is deleted or The sequence obtained from any 2-6 consecutive bases.
  • the general sequence P5 linker is most preferably the sequence shown in SEQ ID NO: 98, that is, 26 bases, and a partial sequence is selected from the linker sequence (ie, several bases are deleted.
  • the universal sequence P5 adapter can be 20-26 in length in the sequence shown in SEQ ID NO:98 The sequence of the base moiety, which also enables double-strand synthesis.
  • n in the two-strand synthetic primer is any integer of 4-10, preferably 6-9: when the number of merged bases N is less than 4, the pairing and binding of the primer and the template cDNA is unstable , when the number is higher than 10, the annealing pairing efficiency of primers and templates will decrease, and the cost of primer synthesis will increase; therefore, in the present invention, the effect can be achieved with 4-10 merged base N numbers, and preferably 6- N number of 9 merged bases.
  • the present invention provides the following technical solutions:
  • a method for constructing a 3' RNA-seq library characterized in that, when the library is constructed, the sequencing joints are reversed, specifically connecting the P5 joint with the non-poly(T) end, and connecting the P7 joint with the poly(T) end .
  • the universal sequence P7 linker is the sequence shown in SEQ ID NO: 97, or the sequence obtained by deleting any 1 or any 2-4 consecutive bases in the sequence shown in SEQ ID NO: 97; n is any integer from 12 to 35 (preferably 21); said V is any one of bases A, G, and C; N is any one of bases A, T, C, and G;
  • the barcode sequence is a nucleotide sequence with a length of 4-12 bases, preferably, the barcode sequence is selected from SEQ ID NO: 1-96 any of the .
  • the universal sequence P5 linker is the sequence shown in SEQ ID NO: 98, or the sequence obtained by deleting any 1 or any 2-6 consecutive bases in the sequence shown in SEQ ID NO: 98; Said N is any one of bases A, T, C and G, and n is any integer of 4-10 (preferably 6-9).
  • a reverse transcription primer the sequence of which reverse transcription primer comprises the universal sequence P7 linker-(barcode)(T) n VN;
  • the universal sequence P7 linker is the sequence shown in SEQ ID NO: 97, or the sequence obtained by deleting any 1 or any 2-4 consecutive bases in the sequence shown in SEQ ID NO: 97; n is any integer from 12 to 35 (preferably 21); said V is any one of bases A, G, and C; N is any one of bases A, T, C, and G.
  • a kit for constructing a library at the 3' end of mRNA comprising the reverse transcription primer described in any one of items 6-7.
  • kit according to item 8 further comprising a two-strand synthetic primer, the sequence of which is the universal sequence P5 linker-(N) n ;
  • the universal sequence P5 linker is the sequence shown in SEQ ID NO: 98, or the sequence obtained by deleting any 1 or any 2-6 consecutive bases in the sequence shown in SEQ ID NO: 98; Said N is any one of bases A, T, C and G, and n is any integer of 4-10 (preferably 6-9).
  • SiPAS V2 The process of SiPAS V2 is simplified and the cost is low.
  • SiPAS V2 is optimized for Illumina (PE150) standard sequencing platform. Benefiting from the simplified and standardized library construction process, the labor cost and reagent cost of SiPAS V2 are greatly reduced.
  • SiPAS V2 is very effective in quantifying gene expression.
  • the reads used for the alignment achieve higher base quality, thereby improving the sensitivity of the reads alignment, and the high accuracy and reproducibility of gene expression quantification.
  • SiPAS V2 can eliminate technical duplication when performing large-scale population transcriptome analysis.
  • SiPAS V2 optimizes the library construction process to make it more suitable for paired-end 150bp sequencing, which improves the simplicity of the library construction method and the utilization rate of data. Therefore, SiPAS V2 can more accurately quantify gene expression.
  • SiPAS V2 has a good detection effect on degraded RNA. This is because the RNA 3' end is generally more stable than the RNA's 5' end sequence. High tolerance to RNA degradation reduces gene expression differences caused by RNA integrity and ensures accurate identification of differentially expressed genes between samples.
  • Fig. 1 The experimental design principle of SiPAS V2 of the embodiment of the present invention.
  • (a) The experimental flow of SiPAS V2 of the embodiment of the present invention. 1 Perform cell lysis in a single tube to completely break down the cell wall; 2 Transfer the lysate to a 96-well plate, and then extract total RNA; 3 Use the designed reverse transcription primers containing the barcode tag sequence for mRNA reverse transcription; 4- 8 Combine the samples in the 96-well plate into 1 tube for second strand synthesis, purification of cDNA, size selection, and PCR amplification for sequencing.
  • (b) Design schemes of the embodiment of the present invention and comparative examples 1, 2 and 3. The inventive and comparative examples were intended to evaluate the effect of swapping linker sequences and using UMIs.
  • Comparative Example 1 the barcode was ligated to the P5 linker and no UMI was used.
  • the poly(T) end is connected to the P7 linker and no UMI is used.
  • Comparative Example 2 the poly(T) terminal was ligated to a P5 linker and UMI was used.
  • the optimized design of SiPAS V2 can be obtained through the comparison of 4 tests.
  • Illumina paired-end sequencing R1-end reads are joined to the P5 adapter and R2-end reads are joined to the P7 adapter.
  • Figure 2 simulates the accuracy and sensitivity of the alignment of reads of different lengths.
  • Fig. 3 The reads alignment results of the single-end and paired-end alignment modes of Examples and Comparative Implementations 1, 2, and 3 of the present invention.
  • Figure 4 Effect of read length on pairs.
  • Figure 5 Effect of UMI on quantification of gene expression.
  • SD standard deviation
  • Fig. 6 Accuracy and repeatability of quantification of gene expression in Examples of the present invention and Comparative Examples.
  • Fig. 7 is a comparison between the embodiment of the present invention and comparative example 4 TruSeq.
  • the embodiment of the present invention and comparative example 4 respectively contain 3 and 12 repetitions under each condition.
  • the library was constructed by the two methods for 3 technical repetitions, and 5M reads were sequenced.
  • Figure 8 The RNA integrity value (Rin) detected by the Agilent 2100 Bioanalyzer system for RNAs with different degrees of degradation.
  • Fig. 9 is the performance of the embodiments of the present invention in detecting degraded RNA.
  • (c) and (d) Correlation of gene expression levels before and after RNA degradation.
  • the library building method described in the present invention comprises the steps:
  • the PCR product was purified with an equal volume of Beckman Agencourt AMPureXP beads to obtain a mixed library of mRNA 3' ends.
  • the library construction method described in the present invention can be referred to in FIG. 1 .
  • Hexaploid Chinese spring wheat (Triticum aestivum.ssp.aestivum) was germinated and cultured in Hoagland medium for 14 days (the temperature in the greenhouse was 22 degrees, and the light-dark cycle was 16h/8h). conditions) and at 10:00 p.m. (dark treatment conditions), the upper leaves were taken, quick-frozen in liquid nitrogen and ground, and the total RNA was extracted with Zymo’s Direct-zol TM RNA MiniPrep Plus reagent, and the integrity of the RNA was detected by Agilent2100.
  • the degraded RNA (Rin value 7.4) was used for the library construction operation of the examples of the present invention and comparative examples. Degradation test RNA was fragmented using NEB Fragmentation Kit (E6150S) to Rin values of 6.8 (slightly degraded) and 2.2 (obviously degraded), and the specific operations were performed according to the instructions.
  • Second-strand synthetic primers and primers for reverse transcription were synthesized (the synthesis was performed by Invitrogen), and then diluted to 100 ⁇ M with DEPC water.
  • the 96 barcode sequences (SEQ ID NO:1-96) in the reverse transcription primer are as follows:
  • the reaction was terminated by adding EDTA until the cDNA reached 50 ⁇ M.
  • NEBNext Ultra II Q5 Master Mix (Cat. No.: M0544L)
  • Add the purified product obtained in step (5) to NEBNext Ultra II Q5 Master Mix, 0.5 ⁇ M Illumina RP1 primer and 0.5 ⁇ M Illumina Index primer (Cat. No.: 15013198)
  • the amplification conditions are: 98°C for 30s; 98°C for 15s; 62°C for 15s; 72°C for 60s, run for 10-12 cycles; 72 °C, 7min; 4°C, keep.
  • the Illumina paired-end sequencing mode sequences both ends of the template DNA fragment and generates two reads (reads), of which the read1 (R1) connected to the Illumina P5 adapter sequence is read2 (R2) connected to the Illumina P7 adapter sequence .
  • the reverse transcription primer sequence used is GCCTTGGCACCCGAGAATTCCA-(barcode)(T) 21 VN
  • the two-strand synthetic primer is GTTCAGAGTTTCTACAGTCCGACGATCNNNNNN
  • GCCTTGGCACCCGAGAATTCCA SEQ ID NO: 97
  • GTTCAGATTCTACAGTCCGACGATC SEQ ID NO: 98
  • the sequence of the reverse transcription primer is GTTCAGAGTTTCTACAGTCCGACGATC-(barcode)(T) 21 VN
  • the two-strand synthetic primer is GCCTTGGCACCCGAGAATTCCANNNNNN
  • GCCTTGGCACCCGAGAATTCCA and GTTCAGAGTTTCTACAGTCCGACGATC are Illumina P7 and P5 sequencing adapter sequences
  • the library construction experiment process is the same as the above "Materials and Methods" Part of the library construction process is the same.
  • the reverse transcription primer sequence is GTTCAGAGTTTCTACAGTCCGACGATC-(barcode)N 10 V 5 (T) 21 VN
  • the second-strand synthetic primer is GCCTTGGCACCCGAGAATTCCANNNNNN, wherein GCCTTGGCACCCGAGAATTCCA and GTTCAGAGTTTCTACAGTCCGACGATC are Illumina P7 and P5 sequencing adapter sequences, and N 10 V 5 is a UMI molecule Tag sequence
  • library construction experiment process is the same as the library construction process in the "Materials and Methods" section above.
  • the reverse transcription primer sequence is GCCTTGGCACCCGAGAATTCCA-(barcode)N 10 V 5 (T) 21 VN
  • the second-strand synthetic primer is GTTCAGAGTTTCTACAGTCCGACGATCNNNNNN
  • GCCTTGGCACCCGAGAATTCCA and GTTCAGAGTTTCTACAGTCCGACGATC are Illumina P7 and P5 sequencing adapter sequences
  • N 10 V 5 is a UMI molecule Tag sequence
  • TruSeq Full-Length Transcriptome Library Construction Kit is a commonly used kit for transcriptome library construction. We used this kit commonly used in the prior art to construct full-length transcriptome libraries for processed samples, and each treatment set up 3 For technical repetition, the specific experimental operation steps were carried out according to the kit instructions.
  • the quality of the above-mentioned library was checked, and after the quality of the library was qualified, PE150 paired-end sequencing was performed on the Illumina sequencing platform NovoSeq, and the sequencing data volume of each library was more than 2Gb.
  • the off-machine data of the library is filtered to remove adapter sequences and low-quality bases. After obtaining the filtered data, we will distinguish the sequencing files according to the barcode of each sample, and then use STAR aligner v.2.6.1c (Dobin, A.
  • the length and base quality of sequencing reads are the key to accurate alignment of reads and the basis for accurate quantification of gene expression.
  • To examine how read length affects the alignment accuracy of RNA-seq sequencing reads we simulated the creation of a dataset of 100,000 reads from transcript sequences of the wheat reference genome (IWGSC Ref Seqv1.0). These simulated reads are of different lengths, ranging from 50bp to 150bp. By comparing the original position and the alignment position of a single reads, the consistency of the alignment accuracy of the reads is very good, both greater than 0.999. In contrast, increasing the read length was found to increase the sensitivity of the alignment, from 0.75 to 0.95 (Fig. 2a).
  • RNA control molecule has 92 molecules of known sequence that can be used to compare the accuracy and sensitivity of gene expression detection in RNA-seq experiments.
  • RNA-seq in triplicate on the same leaf sample used in the test using Comparative Example 4 TruSeq.
  • the results show that under the conditions of different sequencing depths, the embodiment of the present invention is better than comparative examples 1, 2 and 3, and shows slightly lower performance than comparative example 4 TruSeq.
  • the difference in the Pearson correlation coefficient between the example of the present invention and the comparative example 4TruSeq is 0.019 on average ( FIG. 6 a ).
  • the performance of the embodiment of the present invention is superior to other testing methods, and achieves high sensitivity, accuracy and repeatability.
  • Differentially expressed gene (DEG) analysis is one of the most common applications of RNA-seq.
  • Both the TruSeq and SiPAS V2 libraries were constructed using wheat leaves sampled at 10 am and 10 pm to identify differentially expressed genes.
  • PCA Principal component analysis
  • Fig. 7c The embodiment of the present invention is highly consistent with comparative example 4 TruSeq.
  • PC1 representing the biological difference between am and pm leaf samples, explained 78% of the total variance.
  • PC2 which represents the technical difference between SiPAS V2 and TruSeq, explained only 18% of the total difference.
  • RNA molecules are sensitive and easy to degrade.
  • Traditional full-length transcriptome detection methods such as TruSeq have very high requirements for RNA integrity, and have poor quantitative effects on degraded RNA genes. Therefore, RNA-seq methods with high tolerance to degraded RNA are favored in high-throughput transcriptomics studies.
  • the integrity of RNA molecules measured by the RNA integrity index value (Rin), reflects the degree of RNA degradation.
  • Rin RNA integrity index value
  • Mg ++ Mg ++ to randomly fragment RNA and simulate the RNA degradation process. The two fragmented samples had Rin values of 6.8 and 2.3, respectively, compared to intact RNA (unfragmented) with a Rin value of 7.4 (Fig. 8).
  • SiPAS V2 is simplified and the cost is low.
  • SiPAS V2 is optimized for Illumina (PE150) standard sequencing platform.
  • 2SiPAS V2 is very effective in quantifying gene expression.
  • the reads used for the alignment achieve higher base quality, thereby improving the sensitivity of the reads alignment, and the high accuracy and reproducibility of gene expression quantification.
  • SiPAS V2 3' RNA-seq methods including SiPAS V2 will perform best when the species under study has high-quality transcriptome gene annotation information.
  • SiPAS V2 has the same performance advantages as TruSeq, and the cost of manpower and reagents is significantly reduced (Table 1 and Table 2), and it is expected to be widely used in large-scale population transcriptome research.
  • Table 1 The library construction cost and process of different library construction methods
  • SiPAS V2 not only has a simplified process and low cost, but also achieves high sensitivity, high accuracy and reproducibility in the quantification of gene expression in complex genomes. Furthermore, SiPAS V2 exhibited remarkable resistance to RNA degradation. These advantages ensure the applicability of SiPAS V2 in large-scale population transcriptomic studies. The application of SiPAS V2 in multiple species will help us deeply understand the mysteries of biological genomics.

Landscapes

  • Chemical & Material Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Organic Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Health & Medical Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Biochemistry (AREA)
  • Zoology (AREA)
  • Wood Science & Technology (AREA)
  • Engineering & Computer Science (AREA)
  • Molecular Biology (AREA)
  • Microbiology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • Physics & Mathematics (AREA)
  • Chemical Kinetics & Catalysis (AREA)
  • Immunology (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Genetics & Genomics (AREA)
  • General Chemical & Material Sciences (AREA)
  • Medicinal Chemistry (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

本发明开发了一种高效的3'RNA-seq方法,即简化的poly(A)锚定测序(SiPAS V2)。该方法特定地调转了文库中二代测序接头,使测序时R1端读取文库的非poly(T)端,并更加适用于标准PE150测序格式。通过评估SiPAS V2在六倍体小麦中的综合性能,我们证明了SiPAS V2在量化基因表达时具有高度的灵敏度、准确度和可重复性。

Description

一种高质量的3’RNA-seq建库方法及其用途 技术领域
本发明涉及一种高质量的3’RNA-seq建库方法及其用途,可用于大规模高通量群体基因表达水平的准确检测。
背景技术
RNA测序(RNA-seq)是现代生物学研究的关键技术,它将许多物种的研究从单一的基因组水平转变为多维组学水平,从而有效地提升了我们对生物基因组学的理解。在过去的几年中,许多农作物已经完成了全基因组测序,产生了大量的基因组数据,例如,小麦,玉米,水稻,木薯,马铃薯和大豆等的遗传变异图谱,同时,许多重要农作物也组装了高质量的泛基因组。大量的基因组学数据研究形成了研究的真空地带,需要填充大量转录组数据来帮助解码基因组功能。因此,高效的RNA-seq技术对于生物学研究变得越来越重要。
3’RNA-seq的出现是RNA-seq技术的巨大飞跃。尽管与传统的RNA-seq方法相比,3'RNA-seq无法检测其他可变剪接,但它具有成本低效率高且基因表达定量准确的优势。近几年,科学家对3’RNA-seq技术进行了积极的探索和开发。主要改进包括使用样品条形码提高建库通量,通过简化文库制备过程进一步降低成本,以及通过使用唯一的分子标识符(UMI,Unique Molecular Identifiers)提高基因表达定量的准确性。目前,尽管这些研究已经取得了巨大的成功,但是这些3’RNA-seq方法均未针对标准高通量双端150/250bp(PE150或PE250)测序平台进行优化,定制化的测序方式(例如,一端测序反应低于150bp)只能在实验室规模上进行。然而,一个至关重要但经常被忽视的事实是,越来越多的测序项目已从研究机构外包给商业测序公司。在生产规模上,这些公司通常使用标准测序模式提供服务,成本大幅度削减。特别是对于RNA-seq,PE150或PE250测序还可以提高基因表达定量的准确性,因为更长的读长通常会提高比对的精确性。换句话说,我们迫切需要一 种简化,准确且普适性强的3'RNA-seq流程,以实现高通量大规模标准化的测序平台测序。
在这里,我们结合了报道的3’RNA建库方法的优势并针对标准双端150bp测序模式(PE150)进行了优化,从而开发了一种有效的基因表达谱分析方法,即简化的poly(A)锚定测序(SiPAS V2)技术。通过使用RNA内参作为对照测试,并将该方法应用于六倍体面包小麦(Triticum aestivum.ssp.aestivum,2n=6x=42,基因组大小=16G),我们实验结果表明SiPAS V2能够高效稳定地准确检测差异表达基因。预计SiPAS V2将促进农作物以及许多其他植物的群体转录组学研究。
发明内容
双端(Paired-End,PE)测序是对DNA模板片段的两端进行测序,并生成高质量、可比对的序列数据。双端测序reads按照测序先后顺序分为read1(R1,连接Illumina P5测序接头)和read2(R2,连接Illumina P7测序接头),目前Illumina高通量测序平台为标准的双端150bp测序模式。Illumina测序平台要求各个分子簇碱基合成反应时间一致,由于酶活性等随着反应进行活性逐渐降低,分子簇内的碱基合成反应会产生不一致,因此R1端碱基质量高于R2端,而3’RNA-seq测序时R1端连续读相同碱基(如poly(T))会造成信号识别困难,进一步加快碱基合成反应不一致,因此,已经报道的3'RNA-seq方法应用定制双端测序(读取长度:R1<R2<150bp)以降低连续读取poly(T)碱基对测序质量的影响,其中R1(poly(T)端)仅测序条形码序列,R2(非poly(T)端)进行全长150bp测序用于序列比对分析。鉴于目前公司高通量测序采用标准的PE150测序模式,我们将3'RNA-seq文库构建方法从三个方面进一步改进。首先,使用PE150双端测序以增加测序读长,测试能否提高reads比对的准确性,能否增加基因表达的检测能力。其次,测试交换测序接头,使R1端测序3’RNA-seq文库非poly(T)端,使R2(poly(T)端)测序文库poly(T)端,分析能否通过提高碱基质量提高reads比对的准确性。第三,单细胞RNA-seq文库由于RNA起始量低,需要增加PCR扩增循 环数,为了校正PCR扩增对reads定量的影响,分析采用特异分子识别序列(UMI,Unique Molecular Identifiers)进行reads计数。在此技术基础上改进的bulk RNA(正常RNA起始量,比如0.5μg以上)文库构建方法保留了UMI标记,UMI序列的存在增加了引物长度和合成成本,降低了有效序列长度,而UMI对正常RNA起始量文库的校正效果却未见评估报道,本发明测试了UMI对正常RNA起始量的3'RNA-seq是否有价值,发现UMI对3'RNA-seq不是必须的,反转录引物省去UMI序列后,每条引物合成成本可降低150元左右。
根据上述假设,结合已报道的3'RNA-seq方法的技术优势,我们进行了模拟分析和文库构建测试分析,建立了准确高效的SiPAS V2的建库方法。
具体地,本发明在文库构建时将测序接头进行调转,即将P5接头与非poly(T)端连接,并且将P7接头与poly(T)端连接,从而使得后续测序过程中首先测序非poly(T)端,然后测序poly(T)端,使其更适用于双端150bp测序,提高了建库方法的简便性和对数据的使用率,可以更加准确进行基因表达定量。
在本发明的具体实施方案中,所述接头调转是通过对反转录引物进行改进而实现的。在本发明的具体实施方案中,所述反转录引物的序列包含通用序列P7接头-(barcode)(T) nVN;其中,所述通用序列P7接头为如SEQ ID NO:97所示的序列,或者是在SEQ ID NO:97所示序列中缺失任意1个或任意2-4个连续碱基而得到的序列。在本发明的具体实施方案中,所述通用序列P7接头最优选为如SEQ ID NO:97所示的序列,即22个碱基,从所述接头序列中选取部分序列(即删除若干个碱基)也可以顺利完成反转录反应,但接头通用序列长度降低后,将导致PCR扩增时与PCR引物与反转录产物退火时互补配对碱基减少,从而降低PCR的效率,低于18个碱基,将导致引物退火时与接头通用序列互补配对困难,无法完成文库扩增,因此在本发明中,通用序列P7接头可以是SEQ ID NO:97所示的序列中长度为18-22个碱基部分的序列,其也可实现反转录反应。
在本发明的具体实施方案中,所述反转录引物中n为12至35的任何整数,优选为21。poly(T)长度为12-35个碱基时,均可以进行反转录,常用的反转录引物T碱基长度最低为12bp,poly(T)长度短,容易在mRNA内部连续T碱基位置发生错配反转录,poly(T)长度增加则可以有效降低mRNA内部的反转录,poly(T)长度35bp以上时,会导致反转录时引物结合速率降低,反转录效率下降,此外poly(T)长度增加会引起引物合成成本的增加,因此,在本发明的反转录引物中,n为12至35的任何整数,即poly(T)长度为12-35个碱基时,均可以进行反转录,n最优选为21。
在本发明的具体实施方案中,所述二链合成引物中的通用序列P5接头为如SEQ ID NO:98所示的序列,或者是在SEQ ID NO:98所示序列中缺失任意1个或任意2-6个连续碱基而得到的序列。在本发明的具体实施方案中,所述通用序列P5接头最优选为如SEQ ID NO:98所示的序列,即26个碱基,从所述接头序列中选取部分序列(即删除若干个碱基)也可以顺利完成二链合成,但通用序列P5接头长度降低后,将导致PCR扩增时PCR引物与反转录产物退火时互补配对碱基减少,从而降低PCR的效率,低于20个碱基,将导致引物退火时与接头通用序列互补配对困难,无法完成文库扩增,因此在本发明中,通用序列P5接头可以是SEQ ID NO:98所示的序列中长度为20-26个碱基部分的序列,其也可实现二链合成。
在本发明的具体实施方案中,在所述二链合成引物中n为4-10的任何整数,优选为6-9:兼并碱基N数目低于4时,引物与模板cDNA配对结合不稳定,数目高于10时,引物与模板退火配对结合效率降低,且引物合成成本增加;因此,在本发明中,4-10个兼并碱基N数目均可实现所述效果,且优选为6-9个兼并碱基N数目。
具体地,本发明提供了以下技术方案:
1、一种3’RNA-seq文库构建方法,其特征在于,文库构建时将测序接头调转,具体为将P5接头与非poly(T)端连接,并且将P7接头与poly(T)端连接。
2、根据项目1所述的文库构建方法,其中,将测序接头调转是通过使用反转录引物和二链合成引物实现的,其中,所述反转录引物的序列包含通用序列P7接头-(barcode)(T) nVN;
其中,所述通用序列P7接头为如SEQ ID NO:97所示的序列,或者是在SEQ ID NO:97所示序列中缺失任意1个或任意2-4个连续碱基而得到的序列;n为12至35的任何整数(优选为21);所述V为碱基A、G、C中的任意一种;N为碱基A、T、C、G中的任意一种;
3、根据项目2所述的文库构建方法,其中所述barcode序列是长度为4-12个碱基的核苷酸序列,优选地,所述barcode的序列选自SEQ ID NO:1-96中的任一项。
4、根据项目2所述的文库构建方法,其中,所述二链合成引物的序列为通用序列P5接头-(N) n
其中所述通用序列P5接头为如SEQ ID NO:98所示的序列,或者是在SEQ ID NO:98所示序列中缺失任意1个或任意2-6个连续碱基而得到的序列;所述N为碱基A、T、C、G中的任意一种,且n为4-10的任何整数(优选为6-9)。
5、根据项目1所述的文库构建方法,其中所述方法包括以下步骤:
使用反转录引物对总RNA进行反转录;
将反转录完毕的样本混合至一个管中,然后降解模板mRNA,得到反转录的产物;
对反转录的产物进行纯化,并在纯化完成后加入二链合成引物进行二链合成;
进行文库片段大小选择,获得文库模板DNA;
进行PCR扩增,以富集文库模板DNA;
纯化PCR产物,获得mRNA 3’末端文库。
6、一种反转录引物,所述反转录引物的序列包含通用序列P7接头-(barcode)(T) nVN;
其中,所述通用序列P7接头为如SEQ ID NO:97所示的序列,或者是在SEQ ID NO:97所示序列中缺失任意1个或任意2-4个连续碱基 而得到的序列;n为12至35的任何整数(优选为21);所述V为碱基A、G、C中的任意一种;N为碱基A、T、C、G中的任意一种。
7、根据项目6所述的反转录引物,其中所述barcode序列是长度为4-12个碱基的核苷酸序列,优选地,所述barcode的序列选自SEQ ID NO:1-96中的任一项。
8、一种用于mRNA 3’末端文库构建的试剂盒,其包含项目6-7中任一项所述的反转录引物。
9、根据项目8所述的试剂盒,其还包含二链合成引物,所述二链合成引物的序列为通用序列P5接头-(N) n
其中所述通用序列P5接头为如SEQ ID NO:98所示的序列,或者是在SEQ ID NO:98所示序列中缺失任意1个或任意2-6个连续碱基而得到的序列;所述N为碱基A、T、C、G中的任意一种,n为4-10的任何整数(优选为6-9)。
10、项目1-5所述的文库构建方法或项目6-7所述的反转录引物或项目8-9所述的试剂盒在mRNA 3’末端混合建库中的用途。
使用文库构建方法能够带来以下有益的技术效果:
(1)SiPAS V2流程简化且成本低。SiPAS V2经过优化,非常适合Illumina(PE150)的标准测序平台。受益于简化和标准化的文库构建流程,SiPAS V2的劳动力成本和试剂成本大幅降低。
(2)SiPAS V2在量化基因表达方面非常有效。通过交换P5和P7接头序列,用于比对的读段(reads)实现了更高的碱基质量,从而提高了reads比对的灵敏度,以及基因表达定量的高准确度和可重复性。值得注意的是,对于小麦基因组中的107,891个基因,仅500万个reads就可使两个技术重复的基因表达水平之间皮尔逊相关系数达到0.96。这表明SiPAS V2进行大规模群体转录组分析时可以免去技术重复。SiPAS V2优化了建库流程,使其更适用于双端150bp测序,提高了建库方法的简便性和对数据的使用率,因此,SiPAS V2可以更加准确进行基因表达定量。
(3)SiPAS V2对降解RNA检测效果良好。这是因为RNA 3’端通常比RNA的5’端序列更稳定。对RNA降解的高耐受性降低了RNA完整程度引起的基因表达差异,保证了样品之间差异表达基因的准确鉴定。
附图说明
图1本发明实施例SiPAS V2的实验设计原理。(a)本发明实施例SiPAS V2的实验流程。①在单管中进行细胞裂解以完全分解细胞壁;②将裂解物转移到96孔板中,然后提取总RNA;③使用设计的含barcode标签序列的反转录引物进行mRNA反转录;④-⑧将96孔板中的样品合并至1管用于第二链合成、纯化cDNA、大小选择和PCR扩增以进行测序。(b)本发明实施例及对比实施例1、2和3的设计方案。本发明实施例和对比实施例旨在评估交换接头序列和使用UMI的效果。在对比实施例1中,条形码被连接到P5接头并且不使用UMI。在本发明实施例中,poly(T)端被连接到P7接头并且不使用UMI。在对比实施例2中,poly(T)端被连接到P5接头并使用UMI。SiPAS V2的优化设计可以通过4次测试的比较得到。在Illumina双端测序中,R1端读段与P5接头连接,R2端读段与P7接头连接。
图2模拟不同长度reads比对的精确度和灵敏度。(a)具有不同reads长度的模拟数据的精确度和灵敏度图。点代表平均值,点周围的条代表100次重复的标准偏差(SD)。点的大小对应于reads长度。(b)模拟reads的质量值。二次函数用于模拟reads单个碱基的质量值。通过改变二次系数,可以生成不同碱基质量的reads(从25到37)。(c)具有不同质量值的reads比对的精确度和灵敏度。点的大小对应于reads质量值,点周围的条代表100次重复的SD。(d)在4个实施例中reads poly(T)端(虚线)和非poly(T)端(实线)的质量分数。阴影表示95%的置信区间。
图3本发明实施例和对比实施1、2、3单端和双端比对模式的reads比对结果。
图4读段长度对比对的影响。(a)单端和双端测序模式下reads比对的精确度和敏感度。对于每种测序模式,绘制了101个点(单端比对为50bp至150bp,双端比对为200bp至300bp)。横条和竖条分别代表敏感度和准确率的标准差(SD)的大小。(b)本发明实施例和对比实施例1-3的有效reads长度分布。灰色框表示第一四分位数、中位数和第三四分位数。带线的黑点代表每个实施例的平均值和标准偏差(SD)。对比实施例2和对比实施例3中的UMI序列被删除。
图5UMI对基因表达定量的影响。(a)和(b)在对比实施例2和3中对RNA-seq中UMI对表达基因计数校正评估。两个计数值都添加了1并取对数。(c)和(d)不同表达水平下基因表达检测的reads计数和UMI计数的比较。空心圆圈表示已检测到的表达基因的平均数,圆圈上下的直线表示基因数的标准偏差(SD)。
图6本发明实施例和对比实施例对基因表达定量的准确性和重复性。(a)不同测序数据量条件下,基因表达水平与ERCC对照转录本已知浓度之间的皮尔逊相关系数(本发明实施例和对比实施例1-3使用CPM,对比实施例4TruSeq使用TPM)。(b)不同测序数据量条件下,不同建库方法技术重复之间小麦基因表达水平的皮尔逊相关系数。
图7本发明实施例与对比实施例4TruSeq的比较。(a)本发明实施例与对比实施例4在不同测序深度(1M到12M)下基因表达水平的相关性。(b)本发明实施例与对比实施例4在5M测序数据量条件下基因表达水平检测的相关性。(c)5M测序reads数据量条件下,本发明实施例与对比实施例4构建的上午10点和晚上10点样品的PCA图。本发明实施例与对比实施例4在每个条件下分别包含3个和12个重复。(d)本发明实施例与对比实施例4中差异表达基因检测的比较,q值<0.05和|Fold Change|>2。两种方法构建文库进行3次技术重复,测序5M reads。
图8降解程度不同的RNA使用Agilent 2100生物分析仪系统检测的RNA完整性数值(Rin)。
图9本发明实施例在检测降解RNA方面的性能。(a)和(b),降解RNA文库技术重复之间基因表达水平的相关性。(c)和(d),RNA降解前后基因表达水平的相关性。
具体实施方式
为使本发明的目的、技术方案和优点更加清楚明白,以下结合具体实施例,并参照附图,对本发明作进一步的详细说明。
下述实施例中所用方法如无特别说明均为常规方法,所用的试剂如无特别说明均为可商购的试剂。
本发明所述的建库方法包括如下步骤:
使用本发明的反转录引物对总RNA进行反转录;
将反转录完毕的96个样本混合至一个管中,然后降解模板mRNA,得到反转录的产物;
对反转录的产物进行纯化,并在纯化完成后加入二链合成引物进行二链合成;
进行文库片段大小选择,以回收150-600bp的片段;
进行PCR扩增,以富集模板DNA;
用等体积的Beckman Agencourt AMPureXP beads纯化PCR产物,获得mRNA 3’末端混合文库。
本发明所述的建库方法可参见图1。
材料和方法
将六倍体中国春小麦(Triticum aestivum.ssp.aestivum)萌发后,在Hoagland培养液中培养至14天时(温室温度为22度,光暗周期为16h/8h),分别于上午10:00(光照条件)和晚上10:00(暗处理条件)取地上部叶片,液氮速冻并研磨后,用Zymo的Direct-zol TM RNA MiniPrep Plus试剂提取总RNA,采用Agilent2100检测RNA完整性,取完整没有明显降解的RNA(Rin值7.4),用于本发明实施例和对比实施例的建库操作。降解测试RNA使用NEB片段化试剂盒(E6150S)打断至Rin值分别为6.8(轻微降解)和2.2(明显降解),具体操作按照说明书进行。
RNA-seq 3’末端文库构建流程:
合成二链合成引物以及反转录的引物(所述合成由Invitrogen公司进行),然后用DEPC水稀释到100μM。反转录引物中的96条barcode序列(SEQ ID NO:1-96)如下:
Figure PCTCN2021117183-appb-000001
本发明文库构建流程:
以总RNA为起始,以本发明改进的反转录引物进行反转录和文库构建,流程如下:
(1)反转录:
以总RNA为起始,不进行mRNA的分离,进行如下操作,使反转录引物和mRNA互补结合;
取RNase/DNase free的200μl PCR管,加入浓度100μM的反转录引物3μl,取浓度为200ng/μl的总RNA 5μl,加水2μl后,混匀离心,放置于PCR仪上,运行94℃ 2min,迅速置于冰上,离心。
加入以下试剂,进行mRNA的反转录:加入0.5mM dNTP,10mM DTT,35.8U ProtoscriptII Reverse Transcriptase(货号:E6560L),轻混离心,放置于PCR仪上,运行25℃ 5min,42℃ 1h。cDNA可以于-20℃冰箱保存。
(2)降解模板mRNA:
加入1μl 4×ExonucleaseI(货号:M0293L),放置于PCR仪上,运行25℃ 1h;
加入20μl体积比为1:1的NaOH(1M)和EDTA(0.5M)的混合物,放置于PCR仪上,运行65℃ 15min;
加入6M盐酸中和。
(3)使用QIAGEN MinElute PCR Purification Kit进行纯化(货号:28004)具体操作按照产品说明进行,用16μl超纯水洗脱。
(4)合成cDNA的互补链:
加1μl 10mM的dNTP(货号:N0447L)和5μl 100μM二链合成引物,放置于PCR仪上,运行70℃ 2min,迅速放置于冰上5min;
加1μl klenow large fragment DNA polymerase(货号:M0210L),放置于PCR仪上,运行37℃ 30min;
加入EDTA直至cDNA为50μM时终止反应。
(5)片段选择:
使用等体积的Beckman Agencourt AMPure XP beads(货号:A63881)进行PCR产物纯化,分别使用0.6倍体积和0.2倍体积的beads进行片段选择,具体流程参照产品说明进行,最后用20μl超纯水溶,吸上清,获得文库模板DNA。
(6)PCR扩增:
参照NEBNext Ultra II Q5 Master Mix(货号:M0544L)的说明书配制反应体系,将步骤(5)获得的纯化产物加入NEBNext Ultra II Q5 Master Mix和0.5μM Illumina RP1primer和0.5μM Illumina Index primer(货号:15013198)配制的反应体系中,放置于PCR仪上,进行PCR扩增,其 中扩增条件为:98℃30s;98℃,15s;62℃,15s;72℃,60s,运行10-12个循环;72℃,7min;4℃,保持。
(7)PCR产物纯化:
使用等体积的Beckman Agencourt AMPure XP beads(货号:A63881)进行PCR产物纯化,具体流程参照产品说明进行,最后用22μl超纯水溶,吸上清,获得文库模板DNA。
本发明实施例
Illumina双端测序模式对模板DNA片段的两端进行测序并产生两条reads(读段),其中连接到Illumina P5接头序列的为read1(R1),连接到Illumina P7接头序列的为read2(R2)。在本实施例中,所使用的反转录引物序列为GCCTTGGCACCCGAGAATTCCA-(barcode)(T) 21VN,二链合成引物为GTTCAGAGTTCTACAGTCCGACGATCNNNNNN,其中GCCTTGGCACCCGAGAATTCCA(SEQ ID NO:97)和GTTCAGAGTTCTACAGTCCGACGATC(SEQ ID NO:98)为Illumina的P7和P5测序接头序列,文库构建实验流程与上述“材料和方法”部分的文库构建流程相同。
对比实施例1
所述反转录引物序列为GTTCAGAGTTCTACAGTCCGACGATC-(barcode)(T) 21VN,二链合成引物为GCCTTGGCACCCGAGAATTCCANNNNNN,其中GCCTTGGCACCCGAGAATTCCA和GTTCAGAGTTCTACAGTCCGACGATC为Illumina P7和P5测序接头序列,文库构建实验流程与上述“材料和方法”部分的文库构建流程相同。
对比实施例2
所述反转录引物序列为GTTCAGAGTTCTACAGTCCGACGATC-(barcode)N 10V 5(T) 21VN,二链合成引物为GCCTTGGCACCCGAGAATTCCANNNNNN,其中GCCTTGGCACCCGAGAATTCCA和GTTCAGAGTTCTACAGTCCGACGATC为Illumina P7和P5测序接头序列,N 10V 5为UMI分子标签序列,文库构建实验流程与上述“材料和方法”部分的文库构建流程相同。
对比实施例3
所述反转录引物序列为GCCTTGGCACCCGAGAATTCCA-(barcode)N 10V 5(T) 21VN,二链合成引物为GTTCAGAGTTCTACAGTCCGACGATCNNNNNN,其中GCCTTGGCACCCGAGAATTCCA和GTTCAGAGTTCTACAGTCCGACGATC为Illumina P7和P5测序接头序列,N 10V 5为UMI分子标签序列,文库构建实验流程上述“材料和方法”部分的文库构建流程相同。
对比实施例4
Illumina公司的TruSeq全长转录组文库构建试剂盒是转录组文库构建的常用试剂盒,我们使用该现有技术常用的试剂盒分别构建了处理样品的全长转录组文库,每个处理设3个技术重复,具体实验操作步骤参照试剂盒说明书进行。
对上述文库进行质量检测,文库质量合格后,在Illumina测序平台NovoSeq进行PE150双端测序,每个文库测序数据量2Gb以上。文库下机数据进行过滤,去掉接头序列和低质量碱基,拿到过滤好的数据我们会根据每个样本的barcode对测序文件进行区分,之后使用STAR aligner v.2.6.1c(Dobin,A.etal.STAR:Ultra fast universal RNA-seq aligner.Bioinformatics 29,15–21(2013))将reads比对到小麦参考基因组IWGSC1.0((IWGSC),T.I.W.G.S.C.et al.Shifting the limits in wheat research and breeding using a fully annotated reference genome.Science 361,eaar7191(2018))。在拿到比对的BAM文件之后使用HTSeq来进行基因表达定量,以方便后续的评估测评。
结果与分析
1.测序reads(读段)的比对模拟分析
测序reads长度和碱基质量是reads准确比对的关键,是基因表达准确定量的基础。为了检查reads长度如何影响RNA-seq的测序reads比对准确性,我们模拟创建了来自小麦参考基因组(IWGSC Ref Seqv1.0)转录本序列的100,000条reads的数据集。这些模拟reads的长度不同,从50bp到150bp。通过比较单个reads的原始位置和比对位置,其中reads 比对精度一致性非常好,都大于0.999。相比之下,发现增加reads长度能提高比对的灵敏度,从0.75到0.95(图2a)。此外,我们使用另一个包含100,000条碱基质量值不同的测序reads的数据集(从25到37)进行模拟,以检查碱基质量对测序reads比对的影响(图2b)。结果表明,比对精度值也很高且一致(>0.997),但比对灵敏度随着碱基质量的增加而增加,范围从0.87到0.89。模拟分析表明,read length(读段长度)和base quality(碱基质量)主要影响了比对敏感性,其中read长度对比对敏感性的影响大于碱基质量,两者对比对精度的影响比较小。进一步分析表明,只要测序reads唯一地比对到基因组,reads比对精度或特异性就很高,并且几乎不受reads长度或碱基质量的影响。
2.测序reads比对
模拟分析表明碱基质量值的增加将提高测序reads比对灵敏度并增加唯一比对reads的数量(图2c),因此,我们进行了本发明实施例和3个对比实施例的测试,以评估在小麦RNA-seq实验中接头交换如何影响reads碱基质量和唯一比对reads数量。我们取上午10点的小麦叶片用于RNA-seq测试,每个测试设12个技术重复。由于只有唯一比对的reads用于后续的基因表达分析,相同测序数据量条件下,我们认为唯一比对reads比例越高的建库方法效率越高。通过接头交换,R1成为用于比对reads的非poly(T)端。正如预期的那样,结果显示接头交换的本发明实施例和对比实施例3在reads的非poly(T)端碱基质量值最高(图2d)。单端reads比对(reads长度150bp,reads数量为5M)结果显示,与未交换接头的对比实施例1和2相比,交换接头的本发明实施例和对比实施例3唯一比对reads的比例增加了10.37%(图3)。
尽管接头交换提高了本发明实施例非poly(T)末端reads的碱基质量,但值得注意的是poly(T)末端的碱基质量有所降低(图2d),这可能是由于测序平台R2端reads碱基质量比R1端低和poly(T)端reads本身碱基质量低的综合影响。根据模拟分析,150bp长度的低质量R2序列对reads比对可能产生两方面的影响,一方面,低碱基质量可能会降低比对敏感性,另一方面,300bp长度的双端reads可以提高比对灵敏度(图4a)。 为了评估R2的整体效果,我们使用5M双端测序reads进行比对分析。结果表明,在3个对比实施例中,唯一比对reads的比例均有所上升。对于本发明实施例和对比实施例3,唯一比对reads分别增加了2.71%和2.34%,分别达到84.33%和84.29%(图3),这与reads长度对比对灵敏度的影响大于碱基质量一致,如模拟分析所示(图2a,2c)。本发明实施例中唯一比对reads的百分比略高于对比实施例3,我们推测可能是由于poly(T)末端的有效reads长度相对较长(图4b)引起的。鉴于双端比对的唯一比对reads的比例较高,我们在以下分析中使用双端测序reads进行比对。
3.基因表达定量
准确和稳定的基因表达定量对于RNA-seq应用至关重要。我们研究了UMI(Unique Molecular Identifier)对校正3'RNA-seq中PCR扩增偏好性的影响。此外,我们从基因表达定量的准确性和可重复性方面比较了本发明实施例和对比实施例。
将UMI锚定到对比实施例2和3中的RNA分子,我们通过比较reads计数和UMI计数来评估UMI的有效性。通过分析每个实施例的12个技术重复,结果显示reads计数和UMI计数之间的平均皮尔逊相关系数(r)在对比实施例2和3中均大于0.999。发现使用UMI校正和不使用UMI校正的基因表达水平高度相似(图5a、5b)。同时,使用reads计数或UMI计数来检测表达基因数量,发现两种方法可以检测到相似数量的基因(图5c、5d)。这两方面的证据都表明,当以大量RNA分子起始扩增较低的PCR循环数建库时(例如每个样本总RNA高于0.5μg,PCR扩增12个循环),是否使用UMI对3’RNA-seq基因定量的准确性影响不明显。
我们使用Invitrogen公司的标准RNA对照分子(ERCC)作为“真实值”来评估基因表达定量的准确性。ERCC有92个已知序列的分子,可用于比较RNA-seq实验基因表达检测的准确性和敏感性。出于比较的目的,我们使用对比实施例4TruSeq对测试中使用的同一叶片样本进行了3个重复的RNA-seq。结果表明,在不同测序深度条件下,本发明实施 例优于对比实施例1、2和3,并且表现出略低于对比实施例4 TruSeq的性能。本发明实施例和对比实施例4TruSeq之间皮尔逊相关系数的差距平均为0.019(图6a)。除了准确性之外,我们还通过计算RNA-seq测试重复之间所有小麦基因(n=107,891)表达水平的皮尔逊相关系数来评估本发明实施例和对比实施例的可重复性。本发明实施例的检测稳定性优于对比实施例1、2和3,略低于对比实施例4TruSeq,皮尔逊相关系数的差值为0.015(图6b)。
总之,通过接头交换,本发明实施例的性能优于其他测试方法,并实现了高灵敏度、准确性和可重复性。
4.SiPAS V2和TruSeq之间的性能比较
由于Illumina TruSeq全长转录组文库构建试剂盒长期以来一直被认为是基因表达谱分析的黄金标准方法,因此我们使用TruSeq对本发明实施例进行对比测试。尽管本发明实施例的准确性和稳定性比对比实施例4TruSeq略低(图6a、6b),但两者之间的一致性随着测序深度的增加而增加(图7a)。当单个样本的reads数量从1M增加到12M时,通过两种方法测量的基因表达水平的皮尔逊相关系数从0.84变为0.91(图7a)。鉴于准确性和重现性随测序数据量的增加呈明显递增趋势(图6a、6b),我们选择了小麦中每个样本5M reads的测序深度来平衡本发明实施例的检测效果和测序成本,在该测序深度下我们观察到本发明实施例和对比实施例4TruSeq高度一致(图7b)。
差异表达基因(DEG)分析是RNA-seq最常见的应用之一。TruSeq和SiPAS V2文库均使用上午10点和晚上10点取样的小麦叶片构建,以识别差异表达基因。为了公平比较,我们在对比实施例4TruSeq和本发明实施例中均使用了5M/重复的测序深度。基因表达的主成分分析(PCA)显示am和pm的不同技术重复明显分开(图7c)。本发明实施例与对比实施例4TruSeq高度一致。值得注意的是,代表am和pm叶片样本之间生物学差异的PC1解释了总方差的78%。然而,代表SiPAS V2和TruSeq之间技术差异的PC2仅解释了总差异的18%。这些结果表明SiPAS V2非常适合捕捉DEG分析中的生物学差异。
基于两种RNA-seq方法的3个重复,我们分析了两个处理之间差异表达的基因。通过应用相同的阈值,即基因表达的倍数变化(am/pm或pm/am)大于2且错误率(FDR)小于0.05,我们确定了相似数量的DEG——总共6,588个,本发明实施例和对比实施例4TruSeq检测到了相似数量的DEG,DEG数目分别为5940和6588,两个数据集共享大量DEG,共享数目为5340个。本发明实施例和对比实施例4TruSeq之间鉴定的差异表达基因皮尔逊相关系数高达0.95,表明本发明实施例与市场上广泛使用的标准方法TruSeq具有一致的DEG检测能力(图7d)。
5.SiPAS V2对降解RNA的检测能力
RNA分子敏感且容易降解,传统的TruSeq等全长转录组检测方法对RNA完整性要求非常高,对降解RNA的基因定量效果差。因此,对降解RNA具有高耐受性的RNA-seq方法在高通量转录组学研究中受到青睐。RNA分子的完整性,由RNA完整性指标值(Rin)衡量,反映了RNA降解的程度。为了评估本发明实施例对降解RNA分子的耐受性,我们使用Mg ++随机片段化RNA并模拟RNA降解过程。与Rin值为7.4的完整RNA(未片段化处理)相比,两个片段化样品的Rin值分别为6.8和2.3(图8)。降解样本的基因表达定量分析表明,本发明实施例对降解RNA具备良好的检测能力——Rin值对使用本发明实施例进行基因表达谱分析稳定性(图9a、8b)和准确性(图9c、9d)的影响可以忽略不计。RNA降解的高耐受性确保了本发明实施例在高通量RNA-seq实验中能够稳定检测差异基因的表达。
综上所述,以上结果表明,本发明作为一种改进的3'RNA-seq方法,为推进植物群体转录组学研究提供了多种优势。①SiPAS V2流程简化且成本低。SiPAS V2经过优化,非常适合Illumina(PE150)的标准测序平台。受益于简化和标准化的文库构建流程(表1),SiPAS V2的劳动力成本和试剂成本大幅降低,建库成本为$1.98(表2)。②SiPAS V2在量化基因表达方面非常有效。通过交换P5和P7接头,用于比对的reads实现了更高的碱基质量,从而提高了reads比对的灵敏度,以及基因表达定量的高准确度和可重复性。值得注意的是,对于小麦基因组中的107,891 个基因,仅有500万个reads就可使两个技术重复的基因表达水平之间皮尔逊相关系数达到0.96。这表明SiPAS V2进行大规模群体转录组分析时可以免去技术重复。③SiPAS V2对降解RNA检测效果良好(图9)。这是因为RNA 3’端通常比RNA的5’端序列更稳定。对RNA降解的高耐受性减少了RNA完整程度引起的基因表达差异,保证了样品之间差异表达基因的准确鉴定。
我们确实观察到SiPAS V2的性能在基因表达定量的准确性和可重复性方面比TruSeq略低(图6a和6b)。这可能是因为TruSeq具有更长的有效reads和更高的碱基质量,而barcode标签序列和poly(T)实际上减少了SiPAS V2用于比对的有效reads长度,同时,SiPAS V2的R2端由于poly(T)连续T碱基的测序reads碱基质量也下降了。还值得注意的是,与全长RNA-seq方法相比,3’RNA-seq基因表达定量的准确性更容易受到参考基因组基因/转录组基因注释质量的影响。然而,当所研究的物种具有高质量的转录组基因注释信息时,包括SiPAS V2在内的3'RNA-seq方法将表现最佳性能。总体而言,SiPAS V2具备与TruSeq同等性能的优势,且人力和试剂成本显著降低(表1和表2),有望在大规模群体转录组研究中推广应用。
表1 不同文库构建方法的建库成本和流程
Figure PCTCN2021117183-appb-000002
*:该建库流程中本步骤省略,不进行。
**:该建库流程中本步骤进行。
表2 SiPAS V2建库成本
Figure PCTCN2021117183-appb-000003
群体转录组学已成为解码基因组功能的重要工具。在这项研究中,我们开发了一种高效的3'RNA-seq方法,以促进植物群体转录组学研究。SiPAS V2不仅流程简化且成本低,在复杂基因组中的基因表达定量方面实现了高灵敏度、高准确性和可重复性。此外,SiPAS V2对RNA降解表现出显著的耐受性。这些优势保证了SiPAS V2在大规模群体转录组学研究中的适用性。SiPAS V2在多个物种中的应用,将有助于我们深入了解生物基因组学的奥秘。
以上所述的具体实施例,对本发明的目的、技术方案和有益效果进行了进一步详细说明,应理解的是,以上所述仅为本发明的具体实施例而已,并不用于限制本发明,凡在本发明的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。

Claims (10)

  1. 一种3’RNA-seq文库构建方法,其特征在于,文库构建时将测序接头调转,具体为将P5接头与非poly(T)端连接,并且将P7接头与poly(T)端连接。
  2. 根据权利要求1所述的文库构建方法,其中,将测序接头调转是通过使用反转录引物和二链合成引物实现的,其中,所述反转录引物的序列包含通用序列P7接头-(barcode)(T) nVN;
    其中,所述通用序列P7接头为如SEQ ID NO:97所示的序列,或者是在SEQ ID NO:97所示序列中缺失任意1个或任意2-4个连续碱基而得到的序列;n为12至35的任何整数(优选为21);所述V为碱基A、G、C中的任意一种;N为碱基A、T、C、G中的任意一种;
  3. 根据权利要求2所述的文库构建方法,其中所述barcode序列是长度为4-12个碱基的核苷酸序列,优选地,所述barcode的序列选自SEQ ID NO:1-96中的任一项。
  4. 根据权利要求2所述的文库构建方法,其中,所述二链合成引物的序列为通用序列P5接头-(N) n
    其中所述通用序列P5接头为如SEQ ID NO:98所示的序列,或者是在SEQ ID NO:98所示序列中缺失任意1个或任意2-6个连续碱基而得到的序列;所述N为碱基A、T、C、G中的任意一种,且n为4-10的任何整数(优选为6-9)。
  5. 根据权利要求1所述的文库构建方法,其中所述方法包括以下步骤:
    使用反转录引物对总RNA进行反转录;
    将反转录完毕的样本混合至一个管中,然后降解模板mRNA,得到反转录的产物;
    对反转录的产物进行纯化,并在纯化完成后加入二链合成引物进行二链合成;
    进行文库片段大小选择,获得文库模板DNA;
    进行PCR扩增,以富集文库模板DNA;
    纯化PCR产物,获得mRNA 3’末端文库。
  6. 一种反转录引物,所述反转录引物的序列包含通用序列P7接头-(barcode)(T) nVN;
    其中,所述通用序列P7接头为如SEQ ID NO:97所示的序列,或者是在SEQ ID NO:97所示序列中缺失任意1个或任意2-4个连续碱基而得到的序列;n为12至35的任何整数(优选为21);所述V为碱基A、G、C中的任意一种;N为碱基A、T、C、G中的任意一种。
  7. 根据权利要求6所述的反转录引物,其中所述barcode序列是长度为4-12个碱基的核苷酸序列,优选地,所述barcode的序列选自SEQ ID NO:1-96中的任一项。
  8. 一种用于mRNA 3’末端文库构建的试剂盒,其包含权利要求6-7中任一项所述的反转录引物。
  9. 根据权利要求8所述的试剂盒,其还包含二链合成引物,所述二链合成引物的序列为通用序列P5接头-(N) n
    其中所述通用序列P5接头为如SEQ ID NO:98所示的序列,或者是在SEQ ID NO:98所示序列中缺失任意1个或任意2-6个连续碱基而得到的序列;所述N为碱基A、T、C、G中的任意一种,n为4-10的任何整数(优选为6-9)。
  10. 权利要求1-5所述的文库构建方法或权利要求6-7所述的反转录引物或权利要求8-9所述的试剂盒在mRNA 3’末端混合建库中的用途。
PCT/CN2021/117183 2021-09-08 2021-09-08 一种高质量的3'RNA-seq建库方法及其用途 WO2023035143A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/117183 WO2023035143A1 (zh) 2021-09-08 2021-09-08 一种高质量的3'RNA-seq建库方法及其用途

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/117183 WO2023035143A1 (zh) 2021-09-08 2021-09-08 一种高质量的3'RNA-seq建库方法及其用途

Publications (1)

Publication Number Publication Date
WO2023035143A1 true WO2023035143A1 (zh) 2023-03-16

Family

ID=85506753

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/117183 WO2023035143A1 (zh) 2021-09-08 2021-09-08 一种高质量的3'RNA-seq建库方法及其用途

Country Status (1)

Country Link
WO (1) WO2023035143A1 (zh)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109055519A (zh) * 2018-08-17 2018-12-21 中国科学院遗传与发育生物学研究所 一种mRNA 3’末端混合建库的反转录引物及其用途
CN110747514A (zh) * 2019-09-27 2020-02-04 北京生命科学研究所 一种高通量单细胞小rna文库构建方法
CN111454942A (zh) * 2020-03-16 2020-07-28 张晓鲁 一种同一样品的转录组和基因组同时建立测序文库的构建方法
CN112126986A (zh) * 2020-04-30 2020-12-25 苏州京脉生物科技有限公司 一种定量miRNA的测序文库制备和分析方法
CN112322700A (zh) * 2019-08-05 2021-02-05 武汉华大医学检验所有限公司 短rna片段文库的构建方法、试剂盒及应用
CN112359093A (zh) * 2020-11-12 2021-02-12 苏州京脉生物科技有限公司 血液中游离miRNA文库制备和表达定量的方法及试剂盒
US20210071247A1 (en) * 2018-05-07 2021-03-11 Roche Innovation Center Copenhagen A/S Massively parallel discovery methods for oligonucleotide therapeutics
CN112680797A (zh) * 2021-02-04 2021-04-20 广州大学 一种去除高丰度rna的测序文库及其构建方法

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210071247A1 (en) * 2018-05-07 2021-03-11 Roche Innovation Center Copenhagen A/S Massively parallel discovery methods for oligonucleotide therapeutics
CN109055519A (zh) * 2018-08-17 2018-12-21 中国科学院遗传与发育生物学研究所 一种mRNA 3’末端混合建库的反转录引物及其用途
CN112322700A (zh) * 2019-08-05 2021-02-05 武汉华大医学检验所有限公司 短rna片段文库的构建方法、试剂盒及应用
CN110747514A (zh) * 2019-09-27 2020-02-04 北京生命科学研究所 一种高通量单细胞小rna文库构建方法
CN111454942A (zh) * 2020-03-16 2020-07-28 张晓鲁 一种同一样品的转录组和基因组同时建立测序文库的构建方法
CN112126986A (zh) * 2020-04-30 2020-12-25 苏州京脉生物科技有限公司 一种定量miRNA的测序文库制备和分析方法
CN112359093A (zh) * 2020-11-12 2021-02-12 苏州京脉生物科技有限公司 血液中游离miRNA文库制备和表达定量的方法及试剂盒
CN112680797A (zh) * 2021-02-04 2021-04-20 广州大学 一种去除高丰度rna的测序文库及其构建方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
MARTIN GEORGES, SCHMIDT RALF, GRUBER ANDREAS J., GHOSH SOUVIK, KELLER WALTER, ZAVOLAN MIHAELA: "3' End Sequencing Library Preparation with A-seq2", JOURNAL OF VISUALIZED EXPERIMENTS, vol. 56129379156129, no. 128, 1 January 2017 (2017-01-01), XP093044045, DOI: 10.3791/56129 *

Similar Documents

Publication Publication Date Title
EP4324931A2 (en) Methods and compositions for identifying or quantifying targets in a biological sample
CN106048009B (zh) 一种用于超低频基因突变检测的标签接头及其应用
CN104894271B (zh) 一种检测基因融合的方法及装置
CN111808854B (zh) 带有分子条码的平衡接头及快速构建转录组文库的方法
JP2021153588A (ja) Rna転写産物バリアントを定量するための方法及び製品
WO2021189679A1 (zh) 一种单细胞转录组测序文库的构建方法及其应用
US11761037B1 (en) Probe and method of enriching target region applicable to high-throughput sequencing using the same
CN108517567B (zh) 用于cfDNA建库的接头、引物组、试剂盒和建库方法
CN112359093B (zh) 血液中游离miRNA文库制备和表达定量的方法及试剂盒
Brouze et al. Measuring the tail: Methods for poly (A) tail profiling
WO2012009952A1 (zh) 关于基因表达的rna测序质控方法及装置
KR20170133270A (ko) 분자 바코딩을 이용한 초병렬 시퀀싱을 위한 라이브러리 제조방법 및 그의 용도
CN111549025B (zh) 链置换引物和细胞转录组文库构建方法
CN110219054B (zh) 一种核酸测序文库及其构建方法
CN112795654A (zh) 用于生物体融合基因检测与融合丰度定量的方法及试剂盒
CN111192637A (zh) 一种lncRNA鉴定和表达定量的分析方法
Poulsen et al. RNA‐Seq for bacterial gene expression
CN108359723B (zh) 一种降低深度测序错误的方法
WO2023035143A1 (zh) 一种高质量的3&#39;RNA-seq建库方法及其用途
CN116064818A (zh) 检测igh基因重排及超突变的引物组、方法和系统
US20230032847A1 (en) Method for performing multiple analyses on same nucleic acid sample
WO2020259303A1 (zh) 一种快速构建rna 3&#39;端基因表达文库的方法
CN115058490A (zh) 一种用于构建微生物靶向测序文库的引物组合及其应用
CN114108103A (zh) 一种高质量的3’RNA-seq建库方法及其用途
US20220364080A1 (en) Methods for dna library generation to facilitate the detection and reporting of low frequency variants

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21956332

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE