CN116622822A - Multiple mixed sample direct RNA nano hole sequencing method - Google Patents

Multiple mixed sample direct RNA nano hole sequencing method Download PDF

Info

Publication number
CN116622822A
CN116622822A CN202310628343.1A CN202310628343A CN116622822A CN 116622822 A CN116622822 A CN 116622822A CN 202310628343 A CN202310628343 A CN 202310628343A CN 116622822 A CN116622822 A CN 116622822A
Authority
CN
China
Prior art keywords
sequence
sequencing
rna
current signal
signal data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310628343.1A
Other languages
Chinese (zh)
Inventor
陈路
林静雯
宋俊伟
唐超
陈钏
耿佳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan University
Original Assignee
Sichuan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan University filed Critical Sichuan University
Publication of CN116622822A publication Critical patent/CN116622822A/en
Pending legal-status Critical Current

Links

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6869Methods for sequencing
    • C12Q1/6874Methods for sequencing involving nucleic acid arrays, e.g. sequencing by hybridisation
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • G16B25/20Polymerase chain reaction [PCR]; Primer or probe design; Probe optimisation
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/10Signal processing, e.g. from mass spectrometry [MS] or from PCR
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Medical Informatics (AREA)
  • Chemical & Material Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Molecular Biology (AREA)
  • Genetics & Genomics (AREA)
  • Organic Chemistry (AREA)
  • Analytical Chemistry (AREA)
  • Software Systems (AREA)
  • Public Health (AREA)
  • Bioethics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Evolutionary Computation (AREA)
  • Wood Science & Technology (AREA)
  • Artificial Intelligence (AREA)
  • Zoology (AREA)
  • Biochemistry (AREA)
  • General Engineering & Computer Science (AREA)
  • Microbiology (AREA)
  • Immunology (AREA)
  • Signal Processing (AREA)
  • Chemical Kinetics & Catalysis (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The application relates to the field of sequencing, in particular to a method for sequencing multiple mixed sample direct RNA nanopores. The method provided by the application is used for sequencing n test sample RNA polynucleotides (n is more than or equal to 2 and less than or equal to 24 and n is an integer), and comprises the following steps: nanopore sequencing is carried out on sequencing structures corresponding to the n test sample RNA polynucleotides, and corresponding current signal data are extracted; dividing current signal data of the barcode sequence, and calling a classification model trained based on the barcode sequence to analyze, so as to obtain sequence information of the barcode sequence, and further obtain classification information of the n test sample RNA polynucleotides; wherein the barcode sequence is specific for each test sample RNA polynucleotide, and the barcode sequence is selected from any one of SEQ ID NOs 1-24. The method provided by the application can realize the simultaneous sequencing of multiple biological samples in one chip, and reduces the initial input amount of the biological samples and the sequencing cost of multiple samples.

Description

Multiple mixed sample direct RNA nano hole sequencing method
Priority application
The present application claims priority of "a novel direct RNA nanopore sequencing method and system" in chinese patent application CN2023102647091 filed 3/17/2023, which is incorporated herein by reference in its entirety.
Technical Field
The invention relates to the field of sequencing, in particular to a method for sequencing multiple mixed sample direct RNA nanopores.
Background
The development of third generation sequencing technologies has drastically altered the ability of long-reading sequencing of genomes and transcriptomes. It is capable of directly sequencing DNA and RNA and modifications thereof without the need for fragmentation and amplification. Direct RNA Sequencing (DRS) technology of Oxford Nanopore Technology (ONT) RNA molecules can be directly sequenced on synthetic membranes through a series of protein nanopores.
When the motor enzyme slowly passes a single RNA molecule through the nanopore, different nucleotides blocking the nanopore can cause current fluctuations, which can be recorded by a base call algorithm to determine the sequence of the RNA molecule. Direct RNA sequencing techniques can produce long reads (up to 21 kb) aimed at obtaining complete transcripts or near complete RNA genomes. Complete sequencing makes it powerful in measuring RNAPoly (a) (poly (a)) tail length and analyzing the regulatory function of poly (a) tails on gene expression regulation. In addition, direct RNA sequencing can directly detect modifications of RNA molecules, such as m6A, 7-methylguanosine (m 7G), pseudouracil (ψ), and 2' -O-methylation (Nm), since these modifications can produce different current fluctuations.
In addition, amplification-free library preparation methods can avoid amplification or Reverse Transcription (RT) bias, and are particularly useful for identifying exon inclusions (exitron s) that are likely to originate from artifacts caused by reverse transcription. These advantages make direct RNA sequencing techniques particularly advantageous in terms of RNA structure determination, tRNA and rRNA identification, and genome sequencing of RNA viruses, prokaryotes, and parasites, among other organisms, of a particular subtype. Recent studies have also utilized direct RNA sequencing technology for transcriptome or epigenetic transcriptome analysis in a wide range of species including humans, animals (nematodes, insects), plants, zooplankton, pathogens, and the like. Another application of direct RNA sequencing technology is to analyze RNA metabolism kinetics by labeling nascent RNAs, such as 5-ethynyl uracil and 4-thiouracil, with base analogs followed by ONT direct RNA sequencing.
While DRS has unique advantages, it has two major limitations, including the lack of a multi-sample, mixed-sample sequencing method and the need for large amounts of RNA, typically 500ng of RNA per sequencing. It is well known that ONTs provide bar code protocols for multiplexing cDNA libraries, which rely on direct attachment of DNA adaptors to the cDNA sequences. In this case, both the barcode and the cDNA sequence can be easily recognized under a DNA model. However, in the case of DRS, since the adaptor is a DNA oligonucleotide and the target fragment is an RNA molecule, the DNA adaptor cannot be correctly recognized under an RNA model or a DNA model. Because the DNA is 6 times faster in the via speed (450 bp/sec) than in the RNA (70 nt/sec). To overcome these limitations, the method of DeeLexiCon has been developed by the team to mix limited RNA samples into one chip, creating multiplexed (multiplex) direct RNA sequencing datasets, thereby reducing the sequencing cost of individual samples. But due to the limitations of the barcode design and algorithm, this method can multiplex only 4 samples in one chip. Although the DeePlexiCon has been reported in the literature to train a new model, it will not be able to accurately de-multiplex reads if different sequences and up to 4 more barcodes are used. Thus, it remains a challenge to provide more barcodes to increase multiplexing capability and achieve high accuracy.
Disclosure of Invention
In a first aspect, the present invention provides a direct RNA nanopore sequencing method (simply called DecodeR), characterized in that the method is used for sequencing n test sample RNA polynucleotides, wherein n is greater than or equal to 2 and less than or equal to 24, and n is an integer, the method comprising:
s1, providing a double-stranded DNA polynucleotide for each test sample RNA polynucleotide to capture an RNA sequence to be tested and form a first hybridization structure; the double-stranded DNA polynucleotide is a double-stranded structure assembled by a forward strand and a backward strand of a reverse transcription joint, the forward strand comprises a barcode sequence and a first connecting sequence from 5 'to 3', the backward strand comprises a second connecting sequence, a barcode sequence and a Poly (T) sequence from 5 'to 3', the RNA sequence to be detected comprises the RNA polynucleotide of the test sample and a Poly (A) sequence, and the double-stranded DNA polynucleotide captures the RNA sequence to be detected through the Poly (T) sequence; the barcode sequence is specific for each of the test sample RNA polynucleotides, the barcode sequence being selected from any one of SEQ ID NOs 1-24;
s2, carrying out reverse transcription on the first hybridization structure to form a second hybridization structure;
S3, connecting the sequencing joint with the second hybridization structure to form a third hybridization structure;
s4, carrying out nanopore sequencing on the third hybridization structure to obtain original sequencing data;
s5, extracting current signal data corresponding to the original sequencing data;
s6, segmenting the current signal data of the barcode sequence in the current signal data, and calling a classification model trained based on the barcode sequence to analyze, so as to obtain the sequence information of the barcode sequence, and further obtain the classification information of the n test sample RNA polynucleotides.
In some embodiments, the double stranded DNA polynucleotide is generated by pre-annealing.
In some embodiments, the method of pre-annealing comprises mixing the forward and backward strands of each of the reverse transcription linkers in a buffer, following the following procedure: denaturation at 95℃for 5 min, annealing at 65℃at 50℃at 37℃at 22℃for 30 min, respectively, and holding at 4 ℃.
In some embodiments, the first connection sequence comprises TAGTAGGTTC.
In some embodiments, the second connection sequence comprises GAGGCGAGCGGTCAATTTT.
In some embodiments, the Poly (T) sequence comprises 10-30 repeat T sequences.
In some embodiments, the Poly (a) sequence comprises 10-30 repeat a sequences.
In some embodiments, the Poly (T) sequence comprises 10 repeat T sequences.
In some embodiments, the Poly (a) sequence comprises 10 repeat a sequences.
In some embodiments, the method of training the classification model comprises segmenting current signal data of a barcode sequence in training sample data comprising one or more training sample RNA polynucleotides and current signal data of a corresponding barcode sequence selected from any one of SEQ ID NOs 1-24, training the segmented current signal data of the barcode sequence using a machine learning algorithm to determine the classification model.
In some embodiments, the classification model is a classification model trained based on the barcode sequences selected from any 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, or 24 of SEQ ID NOs 1-24.
In some embodiments, the training sample RNA polynucleotide comprises in vitro transcribed RNA.
In some embodiments, the partitioning is to partition the current signal data of the barcode sequence into a 70-100 column matrix.
In some embodiments, the partitioning is to partition the current signal data of the barcode sequence into a 100 column matrix.
In some embodiments, the machine learning algorithm includes one or more of K nearest neighbors, neural networks, naive bayes classification, regression trees, adaBOOST, and random forests.
In some embodiments, the machine learning algorithm comprises a random forest.
In some embodiments, the source of the test sample RNA polynucleotides comprises one or more of pathogens, plants, zooplankton, insects, mammals, and tumors.
In some embodiments, the pathogen comprises one or more of RNA viruses, bacteria, fungi, and parasites.
In some embodiments, the in vitro transcribed RNA comprises one or more of the genes shown in table 2.
In some embodiments, the threshold of classification likelihood of the classification is >0.3.
In some embodiments, the step S5 further includes extracting raw current signal data corresponding to the raw sequencing data, and preprocessing the raw current signal data to obtain preprocessed current signal data, where the preprocessing includes one or more of noise reduction, polishing, and normalization.
In some embodiments, the method further comprises aligning the current signal data of the one or more training sample RNA polynucleotides with a corresponding reference sequence, filtering read sequences that do not meet the condition on the condition that the alignment quality is greater than 60, aligned to a unique reference sequence.
In a second aspect, the invention provides a kit for direct RNA sequencing, wherein the kit comprises a reverse transcription linked barcode sequence as set forth in SEQ ID NOs 1-24.
In a third aspect, the present invention provides a direct RNA nanopore sequencing system, comprising:
a sequencing data receiving module for receiving and storing sequencing data, the sequencing data comprising sequencing data of a double stranded polynucleotide, the double stranded polynucleotide comprising a first strand comprising an RNA polynucleotide, a Poly (a) sequence, a barcode sequence, a first ligation sequence, and a sequencing linker in a 5 'to 3' direction, and a second strand comprising a sequencing linker, a second ligation sequence, a barcode sequence, a Poly (T) sequence, and a reverse transcribed sequence of the RNA polynucleotide; the sequencing data includes test sample data and training sample data; the test sample data comprise sequencing data corresponding to n test sample RNA polynucleotides, wherein n is more than or equal to 2 and less than or equal to 24, n is an integer, the barcode sequence has specificity to each test sample RNA polynucleotide, and the barcode sequence is selected from any one of SEQ ID NOs 1-24;
A signal processing module for processing the sequencing data into current signal data;
and the classification module is used for calling a classification model constructed based on the barcode sequence in the test sample data, classifying the current signal data of the test sample data and outputting classification results of the n test sample RNA polynucleotides.
In some embodiments, the system further comprises a model training module that performs training learning on the current signal data of the training sample data using a machine learning algorithm to determine the classification model.
In some embodiments, the signal processing module pre-processes and segments the sequencing data.
In some embodiments, the pre-treatment includes one or more of a noise reduction treatment, a polishing treatment, and a normalization treatment.
In some embodiments, the partitioning is to partition the current signal data of the barcode sequence into a 70-100 column matrix.
In some embodiments, the partitioning is to partition the current signal data of the barcode sequence into a 100 column matrix.
In some embodiments, the machine learning algorithm includes one or more of K nearest neighbors, neural networks, naive bayes classification and regression trees, adaBOOST, and random forests.
In some embodiments, the machine learning algorithm comprises a random forest.
In some embodiments, the system further comprises a filtering module for filtering out reads below a set threshold of classification likelihood based on the threshold.
In some embodiments, the training sample data comprises sequencing data of in vitro transcribed RNA.
In some embodiments, the source of the test sample RNA polynucleotides comprises one or more of pathogens, plants, zooplankton, insects, mammals, and tumors.
In some embodiments, the pathogen comprises one or more of RNA viruses, bacteria, fungi, and parasites.
In some embodiments, the threshold is >0.3.
In some embodiments, the in vitro transcribed RNA comprises one or more of the genes shown in table 2.
In some embodiments, the ratio of the training sample data used as the training set to the test set is 7:3.
In some embodiments, the model training module is configured to compare the current signal data of the training sample data with a corresponding reference sequence, and filter the read sequences that do not satisfy the condition on the condition that the comparison quality is greater than 60 and the comparison is to a unique reference sequence.
Advantageous effects
The nanopore direct RNA mixed sample sequencing classifier DeePlexiCon in the prior art can mix limited RNA samples into one chip to generate a multiplexed (multiplex) direct RNA sequencing dataset, so as to reduce the sequencing cost of single samples. However, the deeplexiCon can split 4 samples on the same chip, has the advantages of low throughput of mixed samples, low resolution accuracy, and great limitation that a large amount of sequencing data needs to be discarded to realize accurate resolution, and greatly improves the sequencing cost.
The method (DecodeR for short) provided by the invention provides 24 bar codes to mark a plurality of RNA samples in one chip, and realizes multiplexing more samples in one chip with high precision and low detection limit by demultiplexing (multiplexing) direct RNA-seq data and machine learning classification, thereby overcoming the limitations of low RNA sample size and high sequencing cost. The invention divides the bar code current into 70-100 columns of matrixes, which can reduce the interference between the currents and reduce the noise. Proved by verification, the recovery rate and the accuracy rate of the DecodeR can reach 100% and 92.2%. Good validation results were obtained by applying DecodeR to samples of various pathogens (RNA virus, bacteria, fungi and Plasmodium) and gliomas. In addition, the invention also discloses some meaningful findings of transcriptomes and mutations of different glioma samples through the DecodeR, which shows that the method provided by the invention has great practical application potential. The method provided by the invention can realize the simultaneous sequencing of a plurality of biological samples in one chip in the nanopore direct RNA sequencing, reduces the initial input amount of the biological samples, reduces the sequencing cost of multiple samples, can realize the sequencing of trace biological samples, has been applied to tumor patient samples, and has the potential of being applied to other clinical samples.
In summary, the present invention provides a complete, mature, high throughput, low cost direct RNA-mix sequencing method, providing more barcodes to increase multiplexing capability and enable high precision sequencing and analysis of multiple biological samples in nanopore DRS, which is expected to be practically applied in clinical samples.
As used herein, "nucleic acid," "nucleic acid molecule," "polynucleotide," and "oligonucleotide" are used interchangeably to refer to covalently linked nucleotide sequences (i.e., ribonucleotides for RNA and deoxyribonucleotides for DNA) in which the 3 'position of the pentose of one nucleotide is linked to the 5' position of the pentose of the next nucleotide through a phosphodiester group. "Polynucleotide" includes single-and double-stranded polynucleotides, such as various DNA, RNA molecules, or any hybrids thereof or derivatives or combinations or fragments thereof.
As used herein, "test sample RNA polynucleotide" refers to RNA that is desired to be sequenced. The RNA may be coding and/or non-coding RNA, such as mRNA, siRNA, rRNA, miRNA, tRNA, lncRNA, snoRNA, snRNA, exRNA, piRNA.
As used herein, "complementary" refers to two nucleic acid sequences capable of forming hydrogen bonds between each other according to the base pairing principle (the Waston-Crick principle) and thereby forming a double stranded structure. In the present invention, "complementary" includes substantially complementary and fully complementary; where perfect complementarity refers to the ability of each base in one nucleic acid sequence to pair with a base in another nucleic acid sequence without a mismatch or gap; essentially complementary means that a majority of bases in one nucleic acid sequence are capable of base pairing with bases in another nucleic acid sequence, which allows for the presence of mismatches or gaps. Typically, two nucleic acid sequences that are complementary (e.g., substantially complementary or fully complementary) will selectively/specifically hybridize or anneal and form a double-stranded structure under conditions that allow hybridization, annealing or amplification of the nucleic acids. Accordingly, "non-complementary" (or "non-matching") means that two nucleic acid sequences cannot hybridize or anneal under conditions that allow for hybridization, annealing or amplification of the nucleic acids.
As used herein, "hybridization" and "annealing" are used interchangeably to refer to the process by which complementary single-stranded nucleic acid molecules form double-stranded nucleic acids. In general, two nucleic acid sequences that are perfectly complementary or substantially complementary may hybridize or anneal.
As used herein, "back end" (or "downstream") is used to describe the relative positional relationship of two nucleic acid sequences and has the meaning commonly understood by one of skill in the art. For example, "one nucleic acid sequence is at the rear end of another nucleic acid sequence" means that when arranged in a 5' end to 3' end orientation, the former is located further rearward (i.e., closer to the 3' end) than the latter. Accordingly, "front end" (or "upstream") has the opposite meaning of "back end".
As used herein, a "barcode" is a specific nucleotide sequence used to label the RNA that is desired to be sequenced. In the present invention, the sequence of the bar code is designed. In other words, the sequence of the bar code is known. In some embodiments, the sequence of the barcode includes one or more of SEQ ID NOs 1-24. During a single sequencing pass, the corresponding barcode sequence for each RNA desired to be sequenced is unique and specific (which may also be understood to be specific). Furthermore, the location of the barcode sequence in the sequencing structure of the invention is fixed (e.g., downstream of poly (a)), so that the barcode sequence in the read sequence can be located by successive a signals of poly (a). The invention can perform training learning on training sample data based on different bar code combinations (such as any 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23 or 24 of SEQ ID NOs) through a machine learning algorithm to construct a classification model. Furthermore, during a single sequencing process (e.g., in a chip), based on the barcode sequences associated with the RNAs desired to be sequenced, a classification model trained from a combination of corresponding barcodes may be selected to analyze the sequencing data, thereby achieving classification of the RNAs desired to be sequenced. In some embodiments, the methods provided herein can achieve sequencing of 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, or 24 RNAs desired to be sequenced during a single sequencing pass.
As used herein, "read" refers to sequencing information of any portion or all of a nucleic acid molecule obtained by a sequencing method. The reads may be stored in a storage medium and appropriately processed to determine if they match the reference sequence or meet other criteria. The reading may be obtained directly from the sequencing device or indirectly from stored sequence information about the sample.
As used herein, "reference sequence" refers to any known sequence with which reads are aligned. In some embodiments, the reference sequence may correspond to all or only a portion of the genome or transcriptome of the organism.
As used herein, "alignment" refers to the process of comparing reads to a reference sequence, thereby determining whether the reference sequence contains a read sequence. If the reference sequence contains a read, the read may be located to the reference sequence or to a specific location in the reference sequence. The match read in the alignment may be a 100% sequence match or less than 100% (not perfect match).
As used herein, "pooled" refers to a sample containing a mixture of nucleic acid molecules of different genomes.
As used herein, a "library" or "library" refers to a collection of nucleic acid molecules derived from one or more nucleic acid samples. In some embodiments, the nucleic acid molecule comprises an identifiable sequence marker (e.g., a barcode). In some embodiments, two or more libraries may be pooled to create a pool of libraries.
As used herein, "multiplexing" or "multiplexing" refers to the collection of one or more nucleic acid samples or library molecules derived therefrom in a well, tube, or reaction (e.g., in a chip).
As used herein, "sample" refers to a sample comprising biomolecules, typically derived from a biological fluid, cell, tissue, organ or organism. "biological sample" refers to any sample of biological origin comprising biomolecules. In the present invention, the biomolecules comprise nucleic acids or nucleic acid mixtures, preferably RNA. In some embodiments, a "sample" or "biological sample" may be a liquid sample (e.g., body fluids, including blood, plasma, serum, urine, vaginal fluid, water sac fluid, vaginal rinse fluid, pleural fluid, ascites fluid, cerebral spinal fluid, saliva, sweat, tears, sputum, bronchoalveolar lavage fluid, nipple discharge, aspirate from a different part of the body (e.g., thyroid, breast) or a solid sample (e.g., a tissue sample, biopsy sample, or tissue cultures or cells derived therefrom, and progeny thereof). In some embodiments, a "sample" or "biological sample" may include a sample enriched for a particular type of molecule, such as a nucleic acid. In some embodiments, a "sample" or "biological sample" may also include clinical samples, such as tissue obtained by surgical excision, tissue obtained by biopsy, cells in culture, cell supernatants, cell lysates, tissue samples, organs, bone marrow, blood, plasma, serum, and the like. In some embodiments, a "sample" or "biological sample" may also include a tumor sample, such as a sample of cells (e.g., cancer cells) from a cancer patient or a nucleic acid sample obtained from cancer cells (e.g., cell lysate, cell extract) of a cancer patient. In some embodiments, the nucleic acid in a "sample" or "biological sample" may be free. In some embodiments, the source of the "sample" or "biological sample" may be a pathogen, such as an RNA virus, bacteria, fungi, parasite. In some embodiments, the source of the "sample" or "biological sample" may be a plant, zooplankton, insect, mammal (e.g., human, dog, cat, horse, goat, sheep, cow, pig, rat, mouse, etc.). In some embodiments, the source of the "sample" or "biological sample" may also be a tumor. The "sample" or "biological sample" may be used directly after being obtained from a source, or after being pre-treated (e.g., immobilized, embedded in media, sectioned, washed, enriched). In some embodiments, nucleic acids (e.g., RNA polynucleotides) in a "sample" or "biological sample" are also extracted.
As used herein, a "test sample" refers to a sample or biological sample that contains nucleic acids that are desired to be sequenced. In some embodiments, the nucleic acid is preferably an RNA polynucleotide. In some embodiments, the test sample is of biological origin. In some embodiments, the test sample may also be derived from the environment, such as soil, a lake, a river, and the like.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. Like elements or portions are generally identified by like reference numerals throughout the several figures. In the drawings, elements or portions thereof are not necessarily drawn to scale. It will be apparent to those of ordinary skill in the art that the drawings in the following description are of some embodiments of the invention and that other drawings may be derived from these drawings without inventive faculty.
FIG. 1 illustrates a schematic diagram of a DecodeR, an exemplary flow and algorithm comparison;
FIG. 2 shows a comparison of the DecodeR of the present invention with a prior art classifier;
FIG. 3 shows a comparison of different bar codes;
FIG. 4 shows the performance effect of the Decode R of the present invention in a real sample;
FIG. 5 shows the detection of RNA from a viral sample by DecodeR of the present invention;
FIG. 6 shows mutation and m6A analysis of tumor samples by DecodeR of the present invention;
FIG. 7 shows a flow chart of a direct RNA nanopore sequencing method provided by the invention;
FIG. 8 shows a schematic block diagram of a multi-barcode direct RNA nanopore sequencing classification system provided by the invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention more clear, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. It will be apparent that the described embodiments are some, but not all, embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
In this document, suffixes such as "module", "component", or "unit" used to represent elements are used only for facilitating the description of the present invention, and have no particular meaning in themselves. Thus, "module," "component," or "unit" may be used in combination.
The terms "upper," "lower," "inner," "outer," "front," "rear," "one end," "the other end," and the like herein refer to an orientation or positional relationship based on that shown in the drawings, merely for convenience of description and to simplify the description, and do not denote or imply that the devices or elements referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus should not be construed as limiting the invention. Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.
Herein, "and/or" includes any and all combinations of one or more of the associated listed items.
Herein, "plurality" means two or more, i.e., it includes two, three, four, five, etc.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
As used in this specification, the term "about" is typically expressed as +/-5% of the value, more typically +/-4% of the value, more typically +/-3% of the value, more typically +/-2% of the value, even more typically +/-1% of the value, and even more typically +/-0.5% of the value.
In this specification, certain embodiments may be disclosed in a format that is within a certain range. It should be appreciated that such a description of "within a certain range" is merely for convenience and brevity and should not be construed as a inflexible limitation on the disclosed ranges. Accordingly, the description of a range should be considered to have specifically disclosed all possible sub-ranges and individual numerical values within that range. For example, the description of ranges 1-6 should be considered to have specifically disclosed subranges such as from 1 to 3, from 1 to 4, from 1 to 5, from 2 to 4, from 2 to 6, from 3 to 6, etc., as well as individual numbers within this range, e.g., 1,2,3,4,5, and 6. The above rule applies regardless of the breadth of the range.
Example 1
1.1 Bar code design and Synthesis
As shown in Table 1, this example incorporates the barcode sequences (RTA-01 to RTA-40, between 20-28bp in length) of the 40 novel reverse transcription linkers (Reverse Transcript Adapter, RTA) designed according to the present invention, as well as the barcode sequences of the 3 RTA from the ONT authorities (RTA-00), the barcode sequences of the DeeLexiCon (RTA-41, RTA-42 and RTA-43) and the barcode sequences of the 4 RTA from Porplex (RTA-44, RTA-45, RTA-46 and RTA-47), for a total of 48 RTA.
The sequencing structure on which the invention is based is shown in FIG. 1 a. Each pair of RTAs consists of a Forward chain (Forward links) and a backward chain (Reverse links). In the 5 'to 3' orientation, the forward strand sequence comprises a barcode sequence (X1) and a first ligation sequence (X2, which may also be referred to as a first mismatched polynucleotide sequence) attached to the back end of the barcode, and the backward strand comprises a second ligation sequence (X2, which may also be referred to as a second mismatched polynucleotide sequence), the barcode sequence (X1), and a Poly (T) sequence (e.g., 10-30 repeated T sequences) attached to the back end of the barcode. The forward and backward strand sequences of RTA were hybridized and assembled into a duplex structure. According to the base complementary pairing rules, RTA includes a barcode region (X1), a non-matching polynucleotide region (X2), and an overhanging Poly (T) sequence. Next, the assembled double-stranded structure is captured and hybridized with the RNA sequence to be detected with the Poly (A) sequence, and the RNA/DNA hybrid double-stranded structure is formed through reverse transcription. Then, the RNA/DNA hybrid double-stranded structure is connected with an ONT sequencing joint (X3), and finally the complete sequencing structure based on the invention is formed. In some embodiments, the first unmatched polynucleotide sequence includes TAGTAGGTTC (SEQ ID NO: 49) and the second unmatched polynucleotide sequence includes GAGGCGAGCGGTCAATTTT (SEQ ID NO: 50).
Next, this example uses an OligoAnalyzer TM Tools the GC content, melting point and secondary structure of the DNA oligonucleotides formed in the invention were evaluated. The length of the bar code designed by the invention is controlled to be 20-28nt, so that the problem that the extra long bar code possibly interferes with the removal of excessive RTA in the magnetic bead purification step is avoided.
TABLE 1 custom RTA sequences
1.2 preparation of In Vitro Transcribed (IVT) RNA
To test barcodes and model training, this example generated 52 in vitro transcribed (in-vitro transcribed, IVT) RNAs with sequences from 27 mouse genes, 22 human genes and 3 SARS-CoV-2 genes (Table 2). For these gene sequences, forward and backward primers were designed using Primer3web tool (version 4.1.0), and a T7 RNA polymerase (T7 RNAPolymerase) promoter sequence was added to the forward Primer, and 15 repeat T sequence (oligo (dT) 15) sequences were added to the reverse Primer.
Human and mouse total RNA and SARS-CoV-2 genomic RNA are reverse transcribed into first-strand cDNA using reverse transcriptase. The PCR amplification was performed on different DNA fragments using the primers designed as described above, and the quality of the PCR products was determined using agarose gel electrophoresis. IVTRNA was generated by T7 RNA polymerase using 1. Mu.g of the purified PCR product as a template, and purified using RNA Clean & Concentrator-5 according to the manufacturer's recommendations. The length of the IVTRNA was determined using Qsep100 and the concentration and purity of each IVTRNA product was measured using NanoDrop 2000.
TABLE 2 in vitro transcribed RNA sequence information
/>
1.3 Mixed sample direct RNA sequencing library building flow
And (5) assembling the custom RTA. The forward strand (1.54. Mu.M) and the backward strand (1.4. Mu.M) of each RTA were mixed in buffer (10 mM Tris-HCl, pH 7.5, 50mM KCl), respectively (Table 3). The reason for the excessive forward strand used in this example is that the excessive forward strand ensures complete consumption of the backward strand, avoiding the problem that the presence of the backward strand may reduce the yield of the ligation product of RNA-RTA. The assembly of custom RTA was performed following the following PCR procedure: denaturation at 95℃for 5 min, annealing at 65℃at 50℃at 37℃at 22℃for 30 min, and retention at 4℃respectively (Table 4). Finally, it was confirmed by 4% agarose gel electrophoresis whether RTA was assembled. The assembled product can be used as RTA with different bar codes for the mixed sample direct RNA sequencing scheme of the invention.
TABLE 3 Assembly of custom RTA
TABLE 4 PCR programming
Mix direct RNA sequencing library preparation. Each RNA sample with Poly (a) sequence was hybridized and ligated individually to pre-annealed custom RTA using the nebnet quick ligation module (NEB, E6056) and its one-to-one correspondence was recorded. And reverse transcribed using superscript iv reverse transcriptase (thermosipher, 18090200) to form RNA/DNA stable double strands and disrupt the secondary structure of the RNA. The product was purified using 1.8XRNAClean XPbeads (Beckman, a 63987). The reverse transcription products of each reaction were then mixed at the same molar mass and the RNA aptamer mixture (RMX) was ligated to the 3' end of the RNA/DNA hybrid at room temperature. After ligation, the reaction mixture was purified using 1XRNAClean XPbeads (Beckman, a 63987), washed twice with buffer (WSB) and the product eluted with Elution Buffer (EB). Next, the master mix will be constructed following the manufacturer's instructions (ONT, SQK-RNA 002) and adjusted according to the library volumes adapted by the MinION or PromethION sequencer. And then loading the mixture of the upper machine into an ONT sequencing chip with qualified quality inspection, setting sequencing parameters in MinKNOW software, and sequencing for more than 48 hours to obtain enough sequencing data.
1.4 training of DecodeR machine learning models
First, 5 independent experiments were performed using various combinations of 48 RTA-ligated IVTRNA, resulting in 5 DRS raw electrical signal datasets, which after Base calling (Base-calling) and alignment with the reference sequence, resulted in a total of 6,276,168 valid Reads (ranging from 623K to 2.30M). After quality control, 24 RTA barcodes are filtered due to the small number of Reads (Reads) generated.
Next, the raw signal data (FAST 5) of the IVTRNA down-call was base-called under a high-precision model using Guppy (v 3.1.5). Base call Reads (FASTQ) were aligned to the IVTRNA reference sequence using minimap2 (v 2.17-r 941). Next, alignment Reads are filtered on the condition that the alignment quality (MAPQ) is greater than 60, aligned to a unique reference sequence, and grouped according to the connection sequence they align. The original signal was extracted from the FAST5 file using R-package rhdf5 (https:// github. Com/grimbugh/rhdf 5). R-packets smoother (https:// cran.r-project.org/package = smoother) may also be used to make the raw current signal smoother to reduce noise in the raw data.
Whereas the Poly (A) current signal is a continuous base A signal, this example first polishes (polish) and normalizes (normal) the current signal and determines the position of Poly (A) from the mean and variance of the signal at the 3' end of the RNA. Thus, a series of current signals were obtained for ONT sequencing adaptors (also understood as "sequencing adaptors", "RNA adaptor mixtures", etc.), RTA barcodes, poly (A) and in vitro transcribed RNA sequences (FIG. 1 b). Since each sequencing adapter sequence read is identical, no signal of ONT sequencing adapter was deleted at the time of analysis. The current signal of the RTA bar code preceding Poly (A) is then split according to mean and variance using the cpt.mean function in the changepoint package (https:// CRAN.R-project.org/package = changepoint), each bar code signal being split into 70-100 segments, preferably 100 segments (FIG. 1 c).
Taking the example of segmentation into 100 column matrices, all the bar code generated 100 column matrices are ultimately used for model training. R-bag caret (https:// cran.r-project.org/package = caret) is implemented to simplify the process of generating a predictive model based on a random forest classification model. To avoid overfitting, this example uses a method of repeated K-fold cross-validation (10 folds, 10 repetitions) to resample. In order to improve the calculation efficiency of model training, a parallel processing framework supported by a dopalalel packet in R packets is adopted. Since the number of sequential reads per barcode is different, this embodiment first randomly extracts one data set for each classifier using the downsampled function in the caret packet so that all classification groups have the same frequency as the minority class. Then, 70% of these Reads were used for the training set, and all the remaining Reads were used as the test set. This example trains models of 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 24 bar code combinations. It should be appreciated that other types of bar code combinations (e.g., 3, 5, 11, 19, 22, etc.) may also be trained to generate more models.
In summary, as shown in fig. 7, the method for sequencing multiple mixed sample direct RNA nanopores provided by the present invention comprises: first, for each test sample RNA polynucleotide, the forward and backward strand sequences of the RTA are hybridized and assembled into a double-stranded structure (i.e., a double-stranded DNA polynucleotide). The assembled double-stranded structure is captured and hybridized with the test RNA sequence with the Poly (A) sequence to form a first hybridized structure (S1). The first hybridization structure is reverse transcribed to form an RNA/DNA hybrid double-stranded structure (i.e., a second hybridization structure) (S2). And then connecting the RNA/DNA hybridized double-stranded structure with an ONT sequencing joint to form a third hybridization structure (S3). The third hybridization structure is sequenced starting from the 3' end sequence with the motor protein, raw sequencing data is obtained (S4), and a series of raw current signals including ONT sequencing adaptors, barcodes, poly (a) and test sample RNA polynucleotides are obtained and optionally processed (e.g. extracted, noise reduced, normalized) (S5). Then, the current signal data of the bar codes are segmented, the current signal data is predicted by selecting a trained classification model based on a specific number of bar codes, an appropriate filtering threshold value (> 0.3) is selected optionally, and the bar codes and the classification probability of each electric signal are finally obtained (S6).
Based on the above method, the present invention also provides a multi-barcode direct RNA nanopore sequencing system 100.
An exemplary structure of the sequencing system of the present invention is described below, and in some embodiments, as shown in FIG. 8, the sequencing system 100 may include:
a sequencing data receiving module 102 for receiving and storing sequencing data comprising sequencing data of a double stranded polynucleotide comprising a first strand comprising an RNA polynucleotide, a Poly (a) sequence, a barcode sequence, a first ligation sequence, and a sequencing linker in a 5 'to 3' direction and a second strand comprising a sequencing linker, a second ligation sequence, a barcode sequence, a Poly (T) sequence, and a reverse transcribed sequence of the RNA polynucleotide in a 5 'to 3' direction; the sequencing data includes test sample data and training sample data.
In some embodiments, the test sample data comprises sequencing data corresponding to n test sample RNA polynucleotides, wherein 2.ltoreq.n.ltoreq.24 and n is an integer, the barcode sequence is specific for each test sample RNA polynucleotide, and the barcode sequence is selected from any one of SEQ ID NOs 1-24.
In some embodiments, the source of the test sample RNA polynucleotides comprises one or more of pathogens, plants, zooplankton, insects, mammals, and tumors. In some embodiments, the pathogen comprises one or more of RNA viruses, bacteria, fungi, and parasites.
A signal processing module 104 for processing the sequencing data into current signal data.
In some embodiments, the signal processing module 104 pre-processes and segments the sequencing data.
In some embodiments, the pre-treatment includes one or more of a noise reduction treatment, a polishing treatment, and a normalization treatment.
In some embodiments, the partitioning is to partition the current signal data of the barcode sequence into a 70-100 column matrix.
In some embodiments, the partitioning is to partition the current signal data of the barcode sequence into a 100 column matrix.
And the classification module 106 is used for calling a classification model constructed based on the barcode sequence in the test sample data, classifying the current signal data of the test sample data and outputting classification results of the n test sample RNA polynucleotides.
In some embodiments, the sequencing system 100 further comprises a model training module 108, the model training module 108 training learning the current signal data of the training sample data using a machine learning algorithm to determine the classification model.
In some embodiments, the machine learning algorithm includes one or more of K nearest neighbors, neural networks, naive bayes classification, regression trees, adaBOOST, and random forests.
In some embodiments, the machine learning algorithm comprises a random forest.
In some embodiments, the training sample data comprises sequencing data of in vitro transcribed RNA.
In some embodiments, the ratio of the training sample data used as the training set to the test set is 7:3.
In some embodiments, the in vitro transcribed RNA comprises one or more of the genes shown in table 2.
In some embodiments, the model training module 108 is configured to compare the current signal data of the training sample data with a corresponding reference sequence, and filter the read sequences that do not satisfy the condition on the condition that the comparison quality is greater than 60, and the comparison is to a unique reference sequence.
In some embodiments, the sequencing system 100 further comprises a filtering module 110 for filtering out reads below a set threshold of classification likelihood based on the threshold.
In some embodiments, the threshold is >0.3.
In some embodiments, the filtering module 110 and the classifying module 106 may be integrated into one analysis module 112.
1.5 Bar code identification Classification procedure of DecodeR
In some embodiments, the specific workflow of the DecodeR is as follows:
selecting different trained models according to the bar code combination type of the library building experiment, and inputting the models into model parameters;
selecting original electric signal data stored and filtered by MinKNOW software, and inputting the original electric signal data into fast5 parameters under a fast5_pass path generally;
selecting a value in a range of 0-1 as a threshold value, filtering the classified standard, and inputting the classified standard into a cutoff parameter;
the CPU number of operation is selected, the maximum core number of the operation machine is not exceeded, and the selected CPU number is input to the NT parameter.
In some embodiments, the default usage of the DecodeR is as follows: the DecodeR (fast 5, model, nt=1, cutoff=0, include. Low=false), and the output value will record each ReadID and the bar code information and likelihood of classification (probability, 0-1) to which it belongs, as shown in table 5.
TABLE 5DecodeR output result example
The ReadID that is less reliable in classification is filtered out based on the classification probability, and the original Fast5 signal and Fastq sequence are classified on the condition of the ReadID.
Example two
2.1 comparison of DecodeR Algorithm with other Algorithm
The accuracy of 6 different classification algorithms was evaluated using a validation set, including K nearest neighbor (K-Nearest Neighbour, KNN), neural Network (NNET), naive Bayes @, andbayes, NB) classification, regression tree (ClassificationAndRegression Tree, CART), adaBOOST, and Random Forest (RF) (table 6). Although all 6 methods achieved an area under the receive operational characteristic (Area Under The Receiver Operating Characteristic Curve, AUROC) of greater than 0.92 (fig. 1 d) and an associated area under the precision Recall Curve (Area Under The Precision-Recall, AUROC) of greater than 0.83 (fig. 1 e), the random forest algorithm outperformed the other methods, with the highest AUROC (0.9961) and highest AUROC (0.9906) achieved among the 6 methods (fig. 1d, fig. 1 e). In addition, the accuracy (precision), sensitivity (specificity), accuracy (precision), recall (recovery) and F1 score (F1 score) of the random forest algorithm were significantly higher than the other five methods (fig. 1F, table 6). Thus, the present invention chooses to build a sequencing method (abbreviated as DecodeR) based on random forest algorithm (fig. 1 g) for further training, testing and application, performance evaluation and example analysis were performed on RNA of RNA virus, bacteria, fungi, parasites and tumor samples, respectively (fig. 1 h).
Table 6 comparison of various algorithms
2.2 comparison of DecodeR with other methods of Decoding
Currently, there are two Direct RNA Sequencing (DRS) barcode demultiplexing (demux) calculation methods, namely DeeLexiCon (https:// github. Com/Psy-Fer/deeplexin) based on image recognition of residual neural network classifier algorithm and Poeplex (https:// github. Com/hyeshik/poiplex) which have not been formally published.
In order to compare the performance of the DecodeR with other methods, the present embodiment evaluates the resolution accuracy and runtime of each method. First, 5 independent experiments were performed on 8 bar code concatenated IVTRNA of deeplexin con and Poreplex, respectively, to obtain 5 DRS raw electrical signal datasets (table 7). Next, the decoding r is used to train a model for 3 of the data (training set) and the model is used to predict another 2 of the data (validation set). Similarly, predictions of barcode splitting were made for the same validation set using DeeLexiCon and PoreLex. Receiving an operational signature (Receiver Operating Characteristic, ROC) analysis, an Area Under Curve (AUC) index and a precision recall (recovery) Curve are generated by R-packet multiROC (https:// cran.r-project.org/package = multiROC). The CPU time is estimated by the GNUtime command and the user time and the system time are added.
Table 7 demultiplexing method comparison data set
In contrast to the DeePlexiCon, poreplex two methods, the classification accuracy of the 4 barcodes preset by the DecodeR using Poreplex in the de-multiplexing is in the range of 95.82 to 97.83% (96.98±0.84), while the classification accuracy of Poreplex is in the range of 83.4 to 93.18% (89.4±4.21) (fig. 2a, unidentified), and the accuracy, recall and F1-score are higher than those of the Poreplex method. Meanwhile, the sorting accuracy of the DecodeR in the demultiplexed DeeLexiCon bar code ranged from 90.57 to 94.43% (92.29 + -1.60), in contrast to the sorting accuracy of the DeeLexiCon ranging from 51.33 to 85.22% (68.39 + -15.16) (FIG. 2b, unidentified). Furthermore, decoding achieves significantly higher sensitivity, recall and F1 score than demultiplexing of four barcodes (fig. 2b, table 8). DecodeR achieves a higher Accuracy AUC value of 0.997 than DeeLexiCon's 0.946, a higher Accuracy recall AUC value of 0.993 than 0.888, and significantly higher classification Accuracy (Accuracy) and recovery (FIGS. 2c, 2d and 2e, table 8). Finally, to evaluate the efficiency of the de-multiplexing, the CPU time of DecodeR and DeeLexiCon when handling the same number (1-150K) of DRS reads was also calculated, finding that the run speed of DecodeR was 10 times faster than DeeLexiCon (FIG. 2 f).
Table 8 comparison of Decdoer with other demultiplexing methods
2.3 Performance Effect of DecodeR
To evaluate the performance of DecodeR in demultiplexing (direct RNA sequencing) the sensitivity, specificity, recall, etc. of the DecodeR in demultiplexing 2, 4, 6, 8, 10 and up to 24 barcodes was tested using IVT-RNA, the global accuracy of the demultiplexing was reduced from 98.4% to 92.2% (FIG. 3 c), while all AUC values for specificity were greater than 0.99 (FIG. 3a, FIG. 3b, table 9). The accuracy and recovery rate, although decreasing as the number of bar codes increases from 2 to 24, remains at a higher level throughout (fig. 3c, 3 d).
TABLE 9 Performance of DecodeR under different barcodes
Given the ideal performance of DecodeR in demultiplexing DRS data of IVT-RNA, decodeR was next applied to authentic RNA samples isolated from pathogens, including 2 RNA viruses: senecaValley virus (SVV) and Porcine Reproductive and Respiratory Syndrome Virus (PRRSV); 2 bacteria: coli (e.coli) and streptococcus enteritis (s.enteritidis); 1 fungus: saccharomyces cerevisiae (S.cerevisiae); 1 parasite: plasmodium berghei (p.berghei). Three DecodeR library experiments and direct RNA sequencing were performed on different RNA samples and three batches of data were generated. ROC analysis showed that all three replicates had high AUC value specificities (0.999, 0.998, 0.976), sensitivity of 99.2%, 93.3%, 90.6% respectively, specificity of 99.4%, 98.4% respectively and maximum accuracy of 97.6%, 95.9%, 92.5% respectively (fig. 4a, table 10). DecodeR also showed high AUC value of precise recall (0.998, 0.994, 0.937) in all three batches of data (FIG. 4 b). In terms of threshold (cutoff) and accuracy (accuracy), we find that the higher the cutoff value, the higher the accuracy, and the lower the recall (fig. 4 c). Sample imbalance problems caused the accuracy values and F1 scores of these models to be different, and serious sample imbalance problems were associated with low accuracy and F1 scores (table 10).
TABLE 10 Performance of DecodeR under real samples
Example III
3.1 construction of RNA Virus reference genome by DecodeR
To evaluate whether the DecodeR can facilitate construction of viral genomes, reference genome alignment and statistics of aligned Reads length were performed on RNA viral data from three database experiments of the DecodeR, and it was found that Reads measured by the DecodeR had the ability to almost completely cover the reference genomes of SVV and PRRSV (FIGS. 5a, 5 b), which made assembly of RNA viral genomes quick and easy. Specifically, 724 Reads (9.68%, experiment 1) and 111 Reads (1.42%, experiment 2) approached the complete SVV genome (> 7,000 nt), and 42 Reads (2.26%) approached the complete PRRSV genome (> 15,000 nt) (fig. 5 b). Next, single base mutations or Single Nucleotide Polymorphisms (SNPs) in the SVV genome were analyzed, and 414 and 423 mutations were found in experiment 1 and experiment 2, respectively, of which 411 were common mutations in 2 experiments (fig. 5 e), and the mutation ratios of these mutations were highly reproducible (fig. 5 c). For example, the present example detected a mutation at position 5700C to T in SVV virus, which was indeed present in both experiments, and the mutation frequencies were 0.98 and 0.99, respectively (FIG. 5 d). DecodeR also has the ability to detect high frequency deletions (delete), the SVV deletion (e.g., 3nt deletion at positions 4022-4024) was accurately recognized in both experiments (FIG. 5 d). In general, decodeR can be used to detect pathogen RNA in a cost-effective manner and potentially to identify novel or mutant RNA viruses.
3.2 mutations and m6A modifications of tumor samples identified by DecodeR
To illustrate the potential use of DecodeR in cancer, 4 samples of diffuse midline glioma-H3K 27M mutant (DMG) and Glioblastoma (GBM) were subjected to DecodeR mixed-sample pooling experiments and RNA direct sequencing, and mutations of the whole transcriptome, isosporm identification and RNAm were obtained 6 An overview of A modification (FIG. 6 a). In the mutation detection analysis, a total of 17,587 single base mutations were detected by DecodeR in different samples of DMG and GBM, and 97-98% of the mutations were verified by Direct RNA Sequencing (DRS) (fig. 6 b). In contrast to second generation sequencing (NGS) and Sanger sequencing, decodeR exhibits similar accuracy in detecting tumor mutations. Among mutation sites verified by Sanger sequencing, there was a high correlation in mutation frequencies of the common mutations detected between decoding and NGS (r=0.93, pearson correlation, fig. 6 c). For example, chr6:29913430:C on the HLA-A gene>The T mutation sites showed higher mutation frequencies in three DMG samples, 0.87, 0.46 and 0.84 in DMG1, DMG2 and DMG3, respectively, while the mutation frequencies in GBM samples were 0, and mutation sites were also detected in NGS and sanger sequencing, with mutation frequencies similar to that of DecodeR (FIG. 6 d). In contrast, chr16:2519933:C on the ATP6V0C gene >The A mutation site had a higher mutation frequency in GBM, but hardly had mutation in three samples of DMG, as well as the verification of NGS and sanger sequencing (FIG. 6 e). DecodeR also identified conserved deletion regions in different tumor types, e.g., the chr9:12972603-12972607 locus on the SNORD137 gene with a deletion of 5nt, which persisted throughout the tumor samples (FIG. 6 f).
In the m6A modification analysis, the DecodeR co-detected 2,158 highly trusted m6A sites in different samples of DMG and GBM, of which 39% were supported by the DRS and m6AAltas public databases, 60% were only validated by DRS, and 1% were not (fig. 6 g). DecodeR also showed high consistency with DRS in terms of gene expression and m6A detection, correlation was 0.91 and 0.92, respectively (FIG. 6 h). In general, the DecodeR can be used for detecting mutation of a disease sample of a tumor and m6ARNA modification, and the detection result is true and reliable, is verified by first-generation sequencing and second-generation sequencing, and has higher consistency with direct RNA sequencing.
From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising several instructions for causing a computer terminal (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method according to the embodiments of the present invention.
The embodiments of the present invention have been described above with reference to the accompanying drawings, but the present invention is not limited to the above-described embodiments, which are merely illustrative and not restrictive, and many forms may be made by those having ordinary skill in the art without departing from the spirit of the present invention and the scope of the claims, which are to be protected by the present invention.

Claims (10)

1. A method for direct RNA nanopore sequencing for sequencing n test sample RNA polynucleotides, wherein 2.ltoreq.n.ltoreq.24 and n is an integer, the method comprising:
s1, providing a double-stranded DNA polynucleotide for each test sample RNA polynucleotide to capture an RNA sequence to be tested and form a first hybridization structure; the double-stranded DNA polynucleotide is a double-stranded structure assembled by a forward strand and a backward strand of a reverse transcription joint, the forward strand comprises a barcode sequence and a first connecting sequence from 5 'to 3', the backward strand comprises a second connecting sequence, a barcode sequence and a Poly (T) sequence from 5 'to 3', the RNA sequence to be detected comprises the RNA polynucleotide of the test sample and a Poly (A) sequence, and the double-stranded DNA polynucleotide captures the RNA sequence to be detected through the Poly (T) sequence; the barcode sequence is specific for each of the test sample RNA polynucleotides, the barcode sequence being selected from any one of SEQ ID NOs 1-24;
S2, carrying out reverse transcription on the first hybridization structure to form a second hybridization structure;
s3, connecting the sequencing joint with the second hybridization structure to form a third hybridization structure;
s4, carrying out nanopore sequencing on the third hybridization structure to obtain original sequencing data;
s5, extracting current signal data corresponding to the original sequencing data;
s6, segmenting the current signal data of the barcode sequence in the current signal data, and calling a classification model trained based on the barcode sequence to analyze, so as to obtain the sequence information of the barcode sequence, and further obtain the classification information of the n test sample RNA polynucleotides.
2. The method of claim 1, wherein the training method of the classification model comprises segmenting current signal data of a barcode sequence in training sample data comprising one or more training sample RNA polynucleotides and current signal data of a corresponding barcode sequence selected from any one of SEQ ID NOs 1-24, training the segmented current signal data of the barcode sequence using a machine learning algorithm to determine the classification model.
3. The method of claim 1 or 2, wherein the partitioning is to partition current signal data of the barcode sequence into a 70-100 column matrix.
4. The method of claim 2, wherein the training sample RNA polynucleotides comprise in vitro transcribed RNA.
5. The method of claim 2, wherein the machine learning algorithm comprises one or more of K-nearest neighbors, neural networks, naive bayes classification, regression trees, adaBOOST, and random forests.
6. The method of claim 1, wherein the source of the test sample RNA polynucleotides comprises one or more of pathogens, plants, zooplankton, insects, mammals, and tumors.
7. The method of claim 1, wherein the threshold of classification likelihood of classification is >0.3.
8. The method of claim 1 or 2, wherein the partitioning is to partition current signal data of the barcode sequence into a 100 column matrix.
9. The method of claim 1, wherein S5 further comprises extracting raw current signal data corresponding to the raw sequencing data and pre-processing the raw current signal data to obtain pre-processed current signal data, the pre-processing comprising one or more of a noise reduction process, a polishing process, and a normalization process.
10. The method of claim 2, further comprising aligning the current signal data of the one or more training sample RNA polynucleotides with a corresponding reference sequence, and filtering read sequences that do not meet the condition on the condition that the alignment quality is greater than 60, the alignment is to a unique reference sequence.
CN202310628343.1A 2023-03-17 2023-05-30 Multiple mixed sample direct RNA nano hole sequencing method Pending CN116622822A (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN2023102647091 2023-03-17
CN202310264709 2023-03-17

Publications (1)

Publication Number Publication Date
CN116622822A true CN116622822A (en) 2023-08-22

Family

ID=87616816

Family Applications (2)

Application Number Title Priority Date Filing Date
CN202310628604.XA Active CN116676175B (en) 2023-03-17 2023-05-30 Multi-bar code direct RNA nanopore sequencing classifier
CN202310628343.1A Pending CN116622822A (en) 2023-03-17 2023-05-30 Multiple mixed sample direct RNA nano hole sequencing method

Family Applications Before (1)

Application Number Title Priority Date Filing Date
CN202310628604.XA Active CN116676175B (en) 2023-03-17 2023-05-30 Multi-bar code direct RNA nanopore sequencing classifier

Country Status (1)

Country Link
CN (2) CN116676175B (en)

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107109489B (en) * 2014-10-17 2021-11-30 牛津纳米孔技术公司 Nanopore RNA characterization method
WO2018031588A1 (en) * 2016-08-09 2018-02-15 Takara Bio Usa, Inc. Nucleic acid adaptors with molecular identification sequences and use thereof
CN107893100A (en) * 2017-11-16 2018-04-10 序康医疗科技(苏州)有限公司 A kind of unicellular mRNA reverse transcriptions and the method for amplification
AU2019253118B2 (en) * 2018-04-13 2024-02-22 Freenome Holdings, Inc. Machine learning implementation for multi-analyte assay of biological samples
JP2023511368A (en) * 2020-01-22 2023-03-17 ゲートハウス バイオ インコーポレイテッド Small RNA disease classifier
AU2021248502A1 (en) * 2020-03-30 2022-11-03 Grail, Llc Cancer classification with synthetic spiked-in training samples
CN112176032B (en) * 2020-10-16 2021-10-26 广州市达瑞生物技术股份有限公司 Primer combination for nanopore sequencing and library building of respiratory pathogens and application thereof
CN114540472B (en) * 2021-08-27 2024-02-23 四川大学华西第二医院 Three-generation sequencing method

Also Published As

Publication number Publication date
CN116676175A (en) 2023-09-01
CN116676175B (en) 2024-04-09

Similar Documents

Publication Publication Date Title
AU2019250200B2 (en) Error Suppression In Sequenced DNA Fragments Using Redundant Reads With Unique Molecular Indices (UMIs)
JP7051900B2 (en) Methods and systems for the generation and error correction of unique molecular index sets with non-uniform molecular lengths
Sun et al. Principles and innovative technologies for decrypting noncoding RNAs: from discovery and functional prediction to clinical application
AU2018331434B2 (en) Universal short adapters with variable length non-random unique molecular identifiers
Simon et al. Short-read sequencing technologies for transcriptional analyses
US20070042381A1 (en) Bioinformatically detectable group of novel regulatory viral and viral associated oligonucleotides and uses thereof
WO2009085473A4 (en) Genome identification system
US20100049445A1 (en) Method and apparatus for sequencing data samples
CN113463202B (en) Novel RNA high-throughput sequencing method, primer group and kit and application thereof
US20200109397A1 (en) Modular Nucleic Acid Adapters
JP6588536B2 (en) Artificial exogenous reference molecules for comparing species and abundance ratios between microorganisms of different species
EP2333104A1 (en) RNA analytics method
CN116676175B (en) Multi-bar code direct RNA nanopore sequencing classifier
CN114875118A (en) Methods, kits and devices for determining cell lineage
Durai Novel graph based algorithms for transcriptome sequence analysis
CN117976032A (en) Nucleic acid chemical modification prediction model construction method, prediction method and device
Peng Novel bioinformatics approaches for analyzing next-generation sequencing data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination