WO2024010875A1 - Profilage sensible à la répétition d'arn acellulaire - Google Patents

Profilage sensible à la répétition d'arn acellulaire Download PDF

Info

Publication number
WO2024010875A1
WO2024010875A1 PCT/US2023/027043 US2023027043W WO2024010875A1 WO 2024010875 A1 WO2024010875 A1 WO 2024010875A1 US 2023027043 W US2023027043 W US 2023027043W WO 2024010875 A1 WO2024010875 A1 WO 2024010875A1
Authority
WO
WIPO (PCT)
Prior art keywords
repeat
cfrna
aware
sequence reads
disease
Prior art date
Application number
PCT/US2023/027043
Other languages
English (en)
Inventor
Daniel Kim
Roman REGGIARDO
Original Assignee
The Regents Of The University Of California
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by The Regents Of The University Of California filed Critical The Regents Of The University Of California
Publication of WO2024010875A1 publication Critical patent/WO2024010875A1/fr

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/30ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for calculating health indices; for individual health risk assessment

Definitions

  • Cell-free RNA refers to RNA molecules present in a biological sample, e.g., a stool sample or a fluid sample such as a blood plasma or serum sample, a urine sample, a sweat sample, or a saliva sample, where the RNA has been secreted or released from cells.
  • the cell-free RNA transcriptome refers to the entire catalog of cell-free RNAs, including protein-coding messenger RNA (mRNA), RNA transcribed from repeat elements (repeat RNA), long noncoding RNA (IncRNA), transfer RNA, ribosomal RNA, pseudogene RNA, retained-intron RNA, microRNA, and other classes of RNAs.
  • Cell-free RNA is comprised of RNA molecules that may serve as biomarkers of diseases such as cancer. It can provide valuable information for detection and diagnosis of diseases, including stage and progression of the disease, as well as for monitoring response to therapy, determining disease prognosis, identifying treatment strategies, and discovering new therapeutic targets.
  • Cell-free biopsy refers to the minimally invasive collection and analysis of a biological stool or fluid sample to diagnose, monitor, and/or determine the treatment for diseases.
  • Repeats or repeat elements are patterns of sequences throughout the genome that are present in multiple copies. More than half of the human genome sequence is comprised of repeats, and the human genome contains over 5 million repeat elements.
  • the present disclosure relates generally to cell-free RNA sequencing and analysis, and more specifically, to embodiments that utilizes features of repeat elements and/or gene coding or non-coding regions for downstream applications such as disease diagnosis, prognosis, and treatment.
  • a method of analyzing cell-free RNA (cfRNA) disclosed herein comprises sequencing the cfRNA from a biological sample from a subject to produce cfRNA sequence reads, aligning the cfRNA sequence reads using a repeat-aware transcriptome annotation or a repeat-aware genome annotation to obtain repeat-aware cfRNA sequence reads, wherein the repeat-aware transcriptome annotation or the repeat-aware genome annotation comprises annotated genes and annotated repeats; and quantifying the repeat-aware cfRNA sequence reads.
  • the aligning the cfRNA sequence reads comprises aligning the cfRNA sequence reads to an annotated reference, wherein a first plurality of regions of the annotated reference are annotated as corresponding to genes and a second plurality of regions are annotated as corresponding to repeat elements of a plurality of repeat subfamilies, and the quantifying the repeat-aware cfRNA sequence reads comprises, for each gene or repeat element in the plurality of repeat subfamilies, quantifying cfRNA sequence reads that align to a region annotated as corresponding to the gene or the repeat element.
  • the method of analyzing cfRNA further comprises generating repeat-aware cfRNA features at a repeat subfamily level by aggregating repeat- aware cfRNA sequence reads from a same repeat subfamily to obtain the repeat-aware cfRNA features at the repeat subfamily level.
  • the method of analyzing cfRNA further comprises incorporating the repeat-aware cfRNA features into a machine learning model for disease classification.
  • the sequencing the cfRNA comprises: reverse transcribing the cfRNA using oligo dT primers or random hexamer primers to produce complementary DNA (cDNA); and sequencing the cDNA to produce cfRNA sequence reads.
  • the biological sample is a body fluid sample or a stool sample.
  • the method of analyzing cfRNA further comprises isolating extracellular vesicles from the biological sample from the subject; and isolating the cfRNA from the extracellular vesicles.
  • the cfRNA is sequenced using a single molecule sequencer.
  • the method of analyzing cfRNA further comprises detecting a disease in the subject, wherein the detecting the disease comprises detecting one or more repeat-aware cfRNA sequence reads or repeat-aware cfRNA features.
  • the detecting a disease further comprises: detecting the one or more repeat-aware cfRNA sequence reads or repeat-aware cfRNA features at a level that is greater than a level of corresponding repeat-aware cfRNA sequence reads or repeat-aware cfRNA features in subjects that do not have the disease; or detecting the one or more repeat- aware cfRNA sequence reads or repeat-aware cfRNA features at a level that is at least about 2 times greater than a level of corresponding repeat-aware cfRNA sequence reads or repeat- aware cfRNA features in subjects that do not have the disease.
  • the method of analyzing cfRNA further comprises performing a differential expression analysis between the subject and the subjects that do not have the disease using the one or more repeat-aware cfRNA sequence reads or repeat-aware cfRNA features.
  • the method of analyzing cfRNA further comprises incorporating the one or more repeat-aware cfRNA sequence reads or repeat-aware cfRNA features into a machine learning binary classifier for disease classification. [0019] In some embodiments, the method of analyzing cfRNA further comprises incorporating the one or more repeat-aware cfRNA sequence reads or repeat-aware cfRNA features into a machine learning multi-class classifier for disease classification.
  • the method of analyzing cfRNA further comprises identifying a tissue of origin of the disease in the subject, wherein the identify ing the tissue of origin of the disease comprises detecting one or more tissue-specific, repeat-aware cfRNA sequence reads or repeat-aware cfRNA features for deconvolution of cfRNA transcriptome.
  • the method of analyzing cfRNA further comprises selecting a targeted treatment for the detected disease based on the one or more repeat-aware cfRNA sequence reads or repeat-aware cfRNA features. [0022] In some embodiments, the method of analyzing cfRNA further comprises monitoring an efficacy of the targeted treatment for the detected disease using the one or more repeat-aware cfRNA sequence reads or repeat-aware cfRNA features.
  • the disease is a cancer.
  • the disease is a cardiovascular disease.
  • the disease is a neurodegenerative disease.
  • the monitoring the efficacy of the targeted treatment for the detected disease comprises obtaining a biological sample of the subject and performing the methods disclosed herein.
  • a method disclosed herein comprises: for each subject of a plurality of training subjects: obtaining a biological sample from the training subject, wherein a first subset of the plurality of training subjects is labeled as having a particular type of cancer, and wherein a second subset of the plurality of training subjects is labeled as not having a particular type of cancer; extracting cfRNA from the biological sample, wherein the extracting comprises capturing polyadenylated cell-free RNA (poly-A cfRNA) in the cfRNA to obtain a poly-A cfRNA 1 i brarx i sequencing the poly-A cfRNA in the poly-A cfRNA library to obtain sequence reads; aligning the sequence reads to an annotated reference to obtain repeat-aware sequence reads, wherein a first plurality of regions of the annotated reference are annotated as corresponding to genes and a second plurality of regions are annotated as corresponding to repeat elements of a plurality
  • the method further comprises analyzing poly-A cfRNA in a test biological sample of a test subject to obtain test sequence reads; based on annotations of the test sequence reads in the annotated reference, generating a test feature vector that includes a first number of test sequence reads that align to genes and a second number of test sequence reads that align to each repeat subfamily of the plurality of subfamilies; loading the machine learning model into a memory of a processor; and determining a disease classification by operating on the feature vector using the machine learning model.
  • the method further comprises based on the training, based on the training, selecting a subset of the repeat subfamilies of the plurality of repeat subfamilies as features for the machine learning model; and outputting the selected features.
  • the poly-A cfRNA has at least 4 A’ s at a 3’ end.
  • the sequencing is a single molecule sequencing.
  • a number of the repeat subfamilies is at least 10.
  • the method further comprises generating a poly-A enriched cfRNA cDNA library from the biological sample by reverse transcribing the poly-A cfRNA; and sequencing the poly-A enriched cfRNA cDNA library to obtain the sequence reads.
  • the method further comprises synthesizing cDNA from the poly-A cfRNA to obtain a cDNA library; and sequencing the cDNA in the cDNA library to obtain the sequence reads.
  • the analyzing the poly-A cfRNA in the test biological sample uses PCR to quantify the test sequence reads.
  • FIG. 1 is an exemplary flowchart illustrating a repeat-aware sequencing and analysis pipeline according to various embodiments.
  • FIG. 2 illustrates a distribution of cell-free RNA lengths in base pairs (bp) for GENCODE biotypes or repeat superfamily elements in pancreatic cancer patients according to certain embodiments.
  • FIG. 3A are density plots depicting the relationship between expected and observed SINE cell-free RNA length in pancreatic cancer patients according to certain embodiments.
  • FIG. 3B illustrates a cumulative distribution function plot of SINE cell-free RNA length empirically calculated in pancreatic cancer patients according to certain embodiments.
  • FIG. 4A illustrates a comparison of mapping rates between use of a Repeat-naive reference annotation and Repeat-aware reference annotation according to certain embodiments.
  • FIG. 4B illustrates a comparison of gene detection distributions for each cohort across coding genes, long noncoding RNAs, and transposable element subfamilies according to certain embodiments.
  • FIG. 5A illustrates a comparison of age distributions between cohorts according to certain embodiments.
  • FIG. 5B illustrates number of samples, stratified by gender, in each cohort according to certain embodiments.
  • FIG. 5C is a heatmap of Pearson correlation between patient samples using Repeat- naive quantification according to certain embodiments.
  • FIG. 5D is a heatmap of Pearson correlation between patient samples using Repeat- aware quantification according to certain embodiments.
  • FIG. 6A is a scatter plot depicting transcripts-per-million abundance for transcripts detected in matched nanopore and Illumina libraries from a patient cohort according to certain embodiments.
  • FIG. 6B is a scatter plot depicting transcripts-per-million abundance for transcripts detected in matched nanopore and Illumina libraries from a patent cohort according to certain embodiments.
  • FIG. 7 is an exemplary diagram illustrating a computational approach for the repeat- aware sequencing platform using a repeat-aware transcriptome annotation according to certain embodiments.
  • FIG. 8 is an exemplary flowchart illustrating a method for repeat-aware analysis according to various embodiments.
  • FIG. 9 shows a distribution of biotype representation in cell-free RNA-seq quantifications for samples from healthy and patient cohorts, shaded by GENCODE biotype or repeat subfamily, and facetted by stage according to certain embodiments.
  • FIG. 10A is a comparison of significantly different Shannon Entropy distributions for GENCODE biotype and repeat subfamilies according to certain embodiments.
  • FIG. 10B is a volcano plot of differentially expressed genes/repeat subfamilies derived from repeat-aware quantification according to certain embodiments.
  • FIG. 11 A is an ROC curve (with confusion matrix inset) for logistic regression classifier trained on significantly differentially expressed GENCODE genes and repeat elements (Repeat-aware) according to certain embodiments.
  • FIG. 1 IB is a volcano plot showing differential expression in Ctrl vs. pancreas using cell-free RNA of genes and repeat elements used in Repeat-aware classifier according to certain embodiments.
  • FIG. 11C is an ROC curve (with confusion matrix inset) for logistic regression classifier trained only on significantly differentially expressed repeat elements (Repeat-alone) according to certain embodiments.
  • FIG. 1 ID is a volcano plot showing differential expression in Ctrl vs. pancreas using cell-free RNA of repeat elements used in Repeat-alone classifier according to certain embodiments.
  • FIG. 12 is a heatmap calculated from DESeq2 normalized counts of SINEs and simple repeats according to certain embodiments.
  • FIGS. 13A-13E are volcano plots of differentially expressed genes and repeat subfamilies derived from repeat-aware quantification of cell-free RNA-seq data, with horizontal and vertical lines drawn at -logl 0(0.01) and 0, respectively, according to certain embodiments.
  • FIGS. 14A-14H are statistical graphs illustrating prognostic potential of differentially expressed cell-free RNA according to certain embodiments.
  • FIG. 15 A illustrates PCA dimensions 1 & 2 calculated using variance-stabilized, Repeat-naive quantifications for normal and pancreatic cancer patient samples according to certain embodiments.
  • FIG. 15B illustrates PCA dimensions 1 & 2 calculated using variance-stabilized, Repeat-aware quantifications for normal and pancreatic cancer patient samples according to certain embodiments.
  • FIG. 15C is an MA plot of log2FoldChange between pancreatic cancer patient and normal samples compared to log-scale baseMean derived from DESeq2 according to certain embodiments.
  • FIG. 16A illustrates a distribution of biotype representation (by DESeq2 normalized count) in cell-free RNA-seq quantifications for each cancer type, shaded by GENCODE biotype or repeat subfamily, and facetted by stage according to certain embodiments.
  • FIG. 17A illustrates a distribution of repeat representation (by DESeq2 normalized count) in cell-free RNA-seq quantifications for pancreatic cancer, shaded by repeat subfamily, and facetted by stage according to certain embodiments.
  • FIG. 17B illustrates a distribution of repeat representation (by DESeq2 normalized count) in cell-free RNA-seq quantifications for each cancer type, shaded by repeat subfamily according to certain embodiments.
  • FIGS. 18A-18T illustrate features using in a repeat-aware sequencing platform improve performance of diagnostic models according to certain embodiments.
  • FIGS. 18 A, 18E, 181, 18M, and 18Q show Receiver Operator Characteristic (ROC) curves for the best Repeat-aware model and the equivalent Repeat-naive model. Area under the curve (AUC) estimates are displayed with the improved, Repeat-aware AUC is compared to the Repeat- naive equivalent.
  • FIGS. 18B, 18F, 18J, 18N, and 18R show training sensitivity at 90% specificity for Repeat-naive and Repeat-aware models (95% C.I., binomial).
  • FIGS. 18C, 18G, 18K, 180, and 18S show testing sensitivity calculated with the 90% specificity probability threshold identified in training (95% C.I., binomial).
  • FIGS. 18D, 18H, 18L, 18P, and 18T shows a comparison of model coefficient (beta) to DESeq2 log2FoldChange for non-zero Repeat features used in the Repeat-aware model characterized in the respective row, total number of features displayed.
  • FIGS. 19A-19F are MA plots of log2FoldChange between labeled cancer type and healthy donor samples compared to log-scale baseMean derived from DESeq2 according to certain embodiments.
  • FIGS. 20A-20B are UpSet plots displaying the number of shared and unique up- or down-regulated TE subfamilies across the different cancer types according to certain embodiments.
  • FIG. 21 A illustrates significantly differentially expressed protein-coding and long noncoding RNA genes in pancreatic cancer (pane) samples compared to healthy controls according to certain embodiments.
  • FIG. 21B illustrates significantly differentially expressed protein-coding and long noncoding RNA genes in COVID-19 (covid) samples compared to healthy controls according to certain embodiments.
  • FIG. 22A illustrates gene set enrichment analysis (GSEA) normalized enrichment scores (NES) for pancreatic cancer (pane) and CO VID- 19 (covid) patients based on significantly differentially expressed genes in their cell-free RNA according to certain embodiments.
  • GSEA gene set enrichment analysis
  • NES normalized enrichment scores
  • FIG. 22B is an upset plot demonstrating overlap between MYC Target VI gene set genes in pane and covid cell-free RNA according to certain embodiments.
  • FIG. 22C is a scatterplot showing differential expression of MYC Target VI genes in pane and covid cell-free RNA according to certain embodiments.
  • FIG. 22D illustrates significantly differentially expressed mitochondrial (MT) genes in pane samples compared to healthy controls according to certain embodiments.
  • FIG. 22E illustrates significantly differentially expressed mitochondrial (MT) genes in covid samples compared to healthy controls according to certain embodiments.
  • FIG. 23 is an exemplary flowchart illustrating a machine learning training method according to various embodiments.
  • FIG. 24 illustrates a measurement system according to an embodiment of the present disclosure.
  • FIG. 25 shows a block diagram of an example computer system usable with system and methods according to various embodiments.
  • sample refers to any material derived from a living organism, including but not limited to humans and animals, that contains or comprises biological matter. Samples or biological samples encompass fluids, stools, and their derivatives. A majority of the nucleic acids in a sample (as obtained from the subject or later enriched) can be cell-free. In various embodiments, the majority of DNA in a biological sample (e.g., that has been enriched for cell-free DNA, such as a plasma sample obtained via a centrifugation protocol) can be cell-free, e.g., greater than 50%, 60%, 70%, 80%, 90%, 95%, or 99% of the DNA can be cell-free.
  • At least 1,000 cell-free DNA molecules are analyzed. In other embodiments, at least 10,000 or 50,000 or 100,000 or 500,000 or 1,000,000 or 5,000,000 cell-free DNA molecules, or more, can be analyzed. At least a same number of sequence reads can be analyzed.
  • a “sequence read' refers to a string of nucleotides obtained from any part or all of a nucleic acid molecule.
  • a sequence read may be a short string of nucleotides (e.g., 20-150 nucleotides) sequenced from a nucleic acid fragment, a short string of nucleotides at one or both ends of a nucleic acid fragment, or the sequencing of the entire nucleic acid fragment that exists in the biological sample.
  • a sequence read may be obtained in a variety of ways, e.g., using sequencing techniques or using probes, e.g., in hybridization arrays or capture probes as may be used in microarrays, or amplification techniques, such as the polymerase chain reaction (PCR) or linear amplification using a single primer or isothermal amplification.
  • Example sequencing techniques include massively parallel sequencing, targeted sequencing, Sanger sequencing, sequencing by ligation, ion semiconductor sequencing, and single molecule sequencing (e.g., using a nanopore, or singlemolecule real-time sequencing (e.g., from Pacific Biosciences)).
  • Such sequencing can be random sequencing or targeted sequencing (e.g., by using capture probes hybridizing to specific regions or by amplifying certain region, both of which enrich such regions).
  • Example PCR techniques include real-time PCR and digital PCR (e.g., droplet digital PCR).
  • a statistically significant number of sequence reads can be analyzed, e.g., at least 1,000 sequence reads can be analyzed. As other examples, at least 5,000, 10,000 or 50,000 or 100,000 or 500,000 or 1,000,000 or 5,000,000 sequence reads, or more, can be analyzed.
  • the term “superfamily” may refer to a large group of repeat elements that share significant structural and functional similarities. These repeat elements are typically characterized by their repetitive nature, with multiple copies found throughout the genome. Superfamilies often encompass a diverse range of repeat sequences, exhibiting variations in length, sequence composition, and genomic distribution.
  • repeat element superfamilies include but are not limited to short interspersed nuclear elements (SINE), long interspersed nuclear elements (LINE), long terminal repeat elements (LTR), simple repeats (micro-satellites), low complexity repeats, transposable elements (TE), retrotransposons, satellite repeats, DNA repeat elements (DNA), RNA repeats (including RNA, tRNA, rRNA, snRNA, scRNA, srpRNA), tandem repeats, other repeats (including class RC (Rolling Circle), unknown (other unclassified repeats), and members thereof.
  • Superfamilies comprise families.
  • the term “family” may refer to a subset of repeat elements that exhibit higher sequence similarity and evolutionary relatedness to each other compared to other members of a repeat element superfamily. Families allow for a more detailed classification and understanding of the repeat elements within a superfamily.
  • the Alu family is an exemplary family of the SINE superfamily
  • the LINE-1 (LI) family is an an exemplary family of the LINE superfamily
  • the ERV1, ERVL, and ERVK families are exemplary families of the LTR superfamily.
  • subfamily refers to a specific group of sequences within a repeat element family that exhibit higher sequence similarity' and closer evolutionary relationship. Subfamilies often arise from recent duplications, mutations, or specific genomic events that result in the emergence of distinct sequence variants or subgroups within a repeat element family.
  • Exemplary repeat element subfamilies include subfamilies of the Alu family such as AluJ subfamilies (e.g., AluJb, AluJo, AluJr), AluS subfamilies (e.g., AluSxl, AluSx, AluSz, AluSg, AluSx3, AluSz6, AluSc, AluSc8, AluSq, AluSx4, AluSg7, AluSc5, AluSg4), and AluY subfamilies (e.g., AluYa5, AluYb8, AluYml, AluYj4, AluYc, AluYb3, AluYe5), subfamilies of the LI family (e.g., L1MD2, L1MC2, LI Med), and subfamilies of the ERV1 family (e.g., HERVH-int), subfamilies of the ERVK family (e.g., HERVK-m
  • the term “repeat-aware” refers to an approach, methodology, or technique that takes into account the presence and characteristics of repeat elements when analyzing genetic data (e g., RNA sequencing data).
  • the repeat-aware sequencing and analysis pipeline disclosed herein can identify, annotate, and account for the repetitive regions within RNA sequences, enabling a more comprehensive understanding of their structure, function, and implications in downstream applications such as disease diagnosis, prognosis, and treatment.
  • the term “repeat-aware” may be used interchangeably with “repeat-derive.”
  • repeat-naive refers to an analytical approach or methodology that does not take into account the presence or characteristics of repeat elements when analyzing genetic data. Repeat-naive analysis disregards the repetitive regions within a sequence and focuses solely on the unique or non-repetitive portions.
  • transcriptome annotation refers to the process of identifying and characterizing the various components and features within a transcriptome.
  • Transcriptome annotation can assign functional annotations and biological context to the identified RNA molecules, enabling a comprehensive understanding of transcriptomic features such as gene expression, non-coding RNAs, and others.
  • transcriptome annotation refers to the process of identifying and characterizing the various components, include repeat elements, and features within a transcriptome.
  • genomic annotation refers to the process of identifying and characterizing the various components and features within a genome.
  • Transcriptome annotation can assign functional annotations and biological context to the identified RNA molecules, enabling a comprehensive understanding of transcriptomic features such as gene expression, non-coding RNAs, and others.
  • the term “repeat-aware transcriptome annotation” refers to the process of identifying and characterizing the various components, include repeat elements, and features within a genome.
  • the term “parameter” as used herein means a numerical value that characterizes a quantitative data set and/or a numerical relationship between quantitative data sets. For example, a ratio (or function of a ratio) between a first amount of a first nucleic acid sequence and a second amount of a second nucleic acid sequence is a parameter. The parameter can be used to determine any classification described herein.
  • cutoff and “threshold” refer to predetermined numbers used in an operation.
  • a cutoff size can refer to a size above which fragments are excluded.
  • a threshold value may be a value above or below which a particular classification applies. Either of these terms can be used in either of these contexts.
  • a cutoff or threshold may be “a reference value” or derived from a reference value that is representative of a particular classification or discriminates between two or more classifications.
  • a cutoff may be predetermined with or without reference to the characteristics of the sample or the subject. For example, cutoffs may be chosen based on the age or sex of the tested subject. A cutoff may be chosen after and based on output of the test data.
  • certain cutoffs may be used when the sequencing of a sample reaches a certain depth.
  • reference subjects with known classifications of one or more conditions and measured characteristic values e.g., a methylation level, a statistical size value, or a count
  • a reference value can be selected as representative of one classification (e.g., a mean) or a value that is between two clusters of the metrics (e.g., chosen to obtain a desired sensitivity and specificity).
  • a reference value can be determined based on statistical simulations of samples. Any of these terms can be used in any of these contexts.
  • a reference value can be determined in various ways, as will be appreciated by the skilled person. For example, metrics can be determined for two different cohorts of subjects with different known classifications, and a reference value can be selected as representative of one classification (e.g., a mean) or a value that is between two clusters of the metrics (e.g., chosen to obtain a desired sensitivity and specificity). As another example, a reference value can be determined based on statistical simulations of samples. A particular value for a cutoff, threshold, reference, etc. can be determined based on a desired accuracy (e.g., a sensitivity and specificity).
  • a “level of disease” can refer to the amount, degree, or severity of disease associated with an organism.
  • An example of a disease is cancer.
  • Other example diseases include a cardiovascular disease and a neurodegenerative disease.
  • a “machine learning model” can refer to a software module configured to be run on one or more processors to provide a classification or numerical value of a property of one or more samples.
  • An ML model can be generated using sample data (e.g., training data) to make predictions on test data.
  • sample data e.g., training data
  • One example is an unsupervised learning model.
  • Another example type of model is supervised learning that can be used with embodiments of the present disclosure.
  • Example supervised learning models may include different approaches and algorithms including analytical learning, statistical models, artificial neural network, backpropagation, boosting (meta-algorithm), Bayesian statistics, case-based reasoning, decision tree learning, inductive logic programming, Gaussian process regression, genetic programming, group method of data handling, kernel estimators, learning automata, learning classifier systems, minimum message length (decision trees, decision graphs, etc.), multilinear subspace learning, naive Bayes classifier, maximum entropy classifier, conditional random field, nearest neighbor algorithm, probably approximately correct learning (PAC) learning, ripple down rules, a knowledge acquisition methodology, symbolic machine learning algorithms, subsymbolic machine learning algorithms, minimum complexity machines (MCM), random forests, ensembles of classifiers, ordinal classification, data pre-processing, handling imbalanced datasets, statistical relational learning, or Proaftn, a multicriteria classification algorithm.
  • MCM minimum complexity machines
  • the model may include linear regression, logistic regression, deep recurrent neural network (e.g., long short term memory', LSTM), hidden Markov model (HMM), linear discriminant analysis (LDA), k-means clustering, densitybased spatial clustering of applications with noise (DBSCAN), random forest algorithm, support vector machine (SVM), or any model described herein.
  • Supervised learning models can be trained in various ways using various cost/loss functions that define the error from the known label (e.g., least squares and absolute difference from known classification) and various optimization techniques, e.g., using backpropagation, steepest descent, conjugate gradient, and Newton and quasi-Newton techniques.
  • RNA ribonucleic acid
  • RNA ribonucleic acid
  • IncRNAs ribonucleic acid
  • TEs transposable elements
  • RNA Cell-free RNA
  • cfRNA Cell-free RNA
  • the diagnostic and prognostic potential of cell- free RNA is evidenced by the prediction of preeclampsia in pregnancy, and cell-free RNAs serve as biomarkers of diseases such as cancer and Alzheimer’s disease.
  • RNA liquid biopsies enable systemic profiling and analyzing of cell-free RNAs secreted by cells throughout the body.
  • Well-annotated coding and noncoding RNAs are readily detectable in the blood and can serve as disease-related biomarkers.
  • the full composition and clinical utility of the cell-free RNA transcriptome is unclear.
  • RNA-seq whole exome RNA sequencing
  • RNA analysis a distinct advantage emerges as RNA is continuously secreted without relying on cell death or material release. These RNA molecules are consistently released through extracellular vesicles, and their signal exhibits changes associated with oncogenic signaling.
  • the existing RNA analysis methods often lean heavily on the gene coding regions, ignoring the effect of repeat sequences serving as one of the earliest indicators of oncogenic transformation in cells that develop into cancer.
  • cfRNAs as disease biomarkers for clinical applications
  • techniques disclosed herein relate to methods and systems for repeat- aware analyzing of cfRNA transcriptome using a repeat-aware sequencing and analysis platform. Such analysis can enable the detection and quantification of the entire cfRNA transcriptome (including mRNAs, repeat RNAs, IncRNAs, etc.).
  • the disclosed techniques can enable in-depth characterization of disease-specific, repeat-derived cell-free RNAs, as well as the accurate classification of patients by leveraging the feature space of the repeat-derived cell-free RNA transcriptome.
  • the disclosed techniques can identify a robust and dynamic repeat RNA signature for the diagnosis of diseases such as cancer.
  • the disclosed techniques can provide improved performance in classifying cancer (e.g., liver, lung, esophageal, colorectal, and stomach cancer) cell-free RNA data compared with conventional repeat-naive classifiers.
  • the disclosed techniques successfully reduce RNA feature numbers from over 5 million to about 15,000 repeat-derived features, which increases the accuracy of machine learning classification and analysis, as well as reduce computational complexity using the repeat- aware sequencing and analysis platform.
  • increased signal during the early stages of cancer can be obtained and valuable insights can be provided into prognosis, treatment strategies, and monitoring diagnostic tool.
  • the disclosed techniques comprise sequencing cfRNA from a biological sample from a subject to produce cfRNA sequence reads and aligning and quantifying the cfRNA sequence reads using a repeat-aware transcriptome annotation or a repeat-aware genome annotation. It may further comprise performing feature reduction of repeat-derived cfRNAs, and for developing machine learning (ML) models that incorporate these repeat-aware cfRNA features (including rnRNAs, repeat RNAs, IncRNAs, etc.) for classification of diseases such as cancer.
  • ML machine learning
  • cfRNA is extracted and isolated from biofluids (e.g., plasma from blood samples), and the repeat-aware transcnptome annotation is a known RNA annotation (e.g., rnRNAs, IncRNAs, etc.) and/or annotated repeat elements in the human genome.
  • the annotation may comprise over 5 million annotated repeat elements.
  • the aligning and quantifying cfRNA is performed in a repeat-aware manner for subsequent feature reduction and machine learning classification using these repeat-aware features for disease diagnosis.
  • Techniques disclosed herein comprise a repeat-aware sequencing and analysis pipeline that enables repeat-aware characterization of the cell-free R A transcnptome.
  • the pipeline can be incorporated into a repeat-aware sequencing platform (e.g., a COMPLETE- Seq platform).
  • a tractable feature set can be aggregated into the repeat-aware sequencing and analysis pipeline to enable diagnostic modeling with efficacy.
  • FIG. 11 is an exemplary flowchart illustrating a repeat-aware sequencing and analysis pipeline 100. Some or all of the steps of analysis pipeline 100 can be performed by a computer system, potentially in coordination with robotics for handling a sample.
  • a biological sample is obtained.
  • the obtaining of the biological sample may be performed according to various techniques such as those disclosed in Section 11(A).
  • a sample can be obtained from storage, as can be done when another entity obtains the sample from a subject.
  • the biological sample includes cell-free RNA.
  • the biological sample can be a stool sample or a fluid sample, such as blood, serum, plasma, urine, or sweat.
  • multiple biological samples are obtained. The multiple biological samples may be obtained from the same subject or from multiple subjects.
  • cell-free RNA is extracted from the obtained samples.
  • the cfRNA may be extracted according to various techniques such as those disclosed in Section 11(B).
  • the cfRNA is reverse transcribed to produce complementary DNA (cDNA).
  • the reverse transcription may be performed using primers such as oligo dT primers or random hexamer primers.
  • the cfRNA and/or the cDNA is then sequenced to obtain sequencing data, e.g., cfRNA sequence reads according to various techniques such as those disclosed in Section II.
  • An RNA-seq protocol may be administered at 130 to provide robust detection of both coding and noncoding RNAs.
  • cDNA is sequenced to produce cfRNA sequence reads.
  • the sequencing may be performed using any suitable technique, e.g., single molecule sequencing techniques (e.g., nanopore or SMRT sequencing from Pacific Biosciences that can sequence the entire molecule) or nextgeneration sequencing techniques, which may sequence one end or both ends of the molecule.
  • Informative biomarkers regarding disease diagnosis may be identified by the repeat-aware sequencing and analysis platform using the cDNA sequencing data.
  • the sequencing data (e.g., cfRNA sequence reads) are then input into a repeat-aware analysis system to perform repeat-aware analysis, according to various techniques such as those disclosed in Sections III and IV.
  • the repeat-aware analysis may comprise aligning and quantifying the cfRNA sequence reads and aggregating repeat-aware cfRNA sequence reads to obtain repeat-aware cfRNA features in order to reduce the total feature number.
  • the repeat-aware analysis enables in-depth characterization of diseasespecific, repeat-denved cell-free RNAs, e.g., by using a customized, repeat-aware transcriptome annotation or a customized, repeat-aware genome annotation.
  • the repeat-aware analysis can thus focuses on repeat-derived RNAs, as well as taking other coding and noncoding RNA into consideration.
  • a customized transcriptome/genome annotation provided by the repeat-aware analysis system at 140 can comprise at least hundreds, thousands, or millions of repeat element insertions found in the human genome.
  • the repeat element insertions can be found in either introns and/or exons.
  • the customized transcriptome/genome annotation may incorporate both well-annotated coding and noncoding RNAs (e.g., GENCODE, UCSC Genes, NCBI RefSeq, Ensembl Genes, HGNC), as well as any number of the over 5 million repeat element insertions found in the human genome (e.g., RepeatMasker).
  • an annotated reference sequence (e.g., an annotated reference genome) is provided by the repeat-aware analysis system at 140 for alignment and quantification of the cfRNA sequence reads.
  • the annotated reference sequence may comprise multiple regions corresponding to repeat families or repeat subfamilies.
  • Exemplary repeat families include: Short interspersed nuclear elements (SINE), which include Alus;
  • LTR Long terminal repeat elements
  • DNA DNA repeat elements
  • RNA repeats including RNA, tRNA, rRNA, snRNA, scRNA, srpRNA
  • Exemplary repeat subfamilies include Alu members of AluJ (e.g., AluJb, AluJo, AluJr), AluS (e.g., AluSxl, AluSx, AluSz, AluSg, AluSx3, AluSz6, AluSc, AluSc8, AluSq, AluSx4, AluSg7, AluSc5, AluSg4), and AluY (e.g., AluYa5, AluYb8, AluYml, AluYj4, AluYc, AluYb3, AluYe5).
  • AluY e.g., AluYa5, AluYb8, AluYml, AluYj4, AluYc, AluYb3, AluYe5.
  • the resulting reads obtained from sequencing may potentially map to multiple repeat insertions within the human genome. This introduces a degree of ambiguity when determining the precise location of a read alignment. It becomes challenging to ascertain if a read uniquely maps to a specific repeat within a particular region of a chromosome, especially considering the short read length and the repetitive characteristics of the reads. Consequently, in some embodiments, multiple annotations can be determined for a cfRNA sequence read to utilize all available repeat sequences for alignment purposes and quantify the reads based on their mapping to these repeats.
  • a quantification (e g., a count) can be determined based on the alignment for each repeat element insertion and/or gene coding or noncoding region. Other measures include median of ratios (e.g., DESeq2), trimmed mean of M values (TMM) (e.g., EdgeR), counts per million (CPM), transcripts per million (TPM), fragments per kilobase of exon model per million mapped reads (FPKM), reads per kilobase of exon model per million reads (RPKM) may also be used for the quantification. In some embodiments, the quantification may be aggregated by families.
  • TMM trimmed mean of M values
  • CCM transcripts per million
  • FPKM fragments per kilobase of exon model per million mapped reads
  • RPKM reads per kilobase of exon model per million reads
  • the quantification may be aggregated by families.
  • the quantification or the aligned cfRNA sequence reads may also be aggregated from individual repeat element insertions to a subfamily element level.
  • the aggregation technique can reduce the number of features to be used downstream. The reduction can be by at least lOx, 20x, 50x, 75x, lOOx, 120x, 150x, or more.
  • the number of repeat features can be reduced from over 5 million to approximately 15,000 repeat features for disease classification and other downstream analyses.
  • the reduced features are the repeat-aware or repeat-derived features.
  • the quantification or aggregated quantification may be further normalized based on various methods.
  • the repeat-aware cfRNA features are used in downstream applications.
  • the repeat-aware cfRNA features can be informative to disease (e.g., cancer) diagnosis.
  • the repeat-aware cfRNA features are input into a trained machine learning model to provide a classification (e.g., diagnosis or prognosis) of a disease.
  • the repeat-aware cfRNA features may also be used for determination of a targeted treatment or development of a new treatment method.
  • the cfRNA sequence reads/sequencing data may also be used together with the repeat-aware cfRNA features in the downstream applications.
  • FIG. 2 are histograms that illustrate distributions of cell-free RNA lengths in base pairs (bp) for GENCODE biotypes and repeat superfamily elements in pancreatic (pane 6, pane 7) cancer patients. It can be shown from FIG. 2 that the size distributions of proteincoding RNA, IncRNA, long interspersed elements (LINE), SINE, and LTR-derived cell-free RNAs from pancreatic cancer patients vary based on the biotypes and repeat superfamilies, which verifies that repeat-aware cfRNA features can be used as biomarkers for disease diagnosis and other downstream application.
  • the size of cell-free RNA transcripts is up to 1,337 bp in length for protein-coding RNA (median: 456 bp), 970 bp for IncRNA (median: 303 bp), 1,002 bp for LINE (median: 258 bp), 368 bp for SINE (median: 185 bp), and 477 bp for LTR (median: 167 bp).
  • the size may be used as biomarkers for diagnosis.
  • RNA lengths may be used as biomarkers. For example, as clearly illustrated in 210 and 220 of FIG. 2, a bimodal length distribution is shown for the SINE-derived cell-free RNA data, reflecting data from two subfamilies (one full-length, about 300-bp long Alu-derived RNA (220), along with a shorter species of Alu- derived RNA (210)).
  • FIG. 3A further shows exemplary density plots depicting the relationship between expected (genomic SINE locus length) and observed SINE cell-free RNA length in pancreatic cancer patients.
  • FIG. 3B is a cumulative distribution function plot of SINE cell-free RNA length empirically calculated in pancreatic cancer patients.
  • the repeat-aware sequencing and analysis pipeline is also capable to identify enriched gene sets or signaling pathways in the cell-free RNA of patients, especially cancer patients. By analyzing the cell-free RNA transcriptome, specific molecular signatures to the disease can be determined that may provide valuable insights into potential druggable targets or signaling pathways that could be targeted using existing treatments.
  • the efficacy of the administered drug or targeted therapy can be assessed.
  • This approach enables to not only identify potential therapeutic targets but also evaluate the treatment response based on the dynamic alterations in the cell-free RNA profile.
  • the analysis also contributes to personalized and targeted approaches for cancer treatment and monitoring the effectiveness of therapeutic interventions.
  • the repeat-aware sequencing and analysis pipeline and corresponding platform can enhance the performance of analysis of cell-free RNA data over conventional techniques such as well-annotated GENCODE coding and noncoding genes (repeat-naive) for cell-free RNA quantification.
  • using the repeat-aware sequencing platform techniques can significantly enhance the percentage of mapped reads in the pancreatic cancer (pane) patient cfRNA data while keeping the mapping rate of normal subjects comparably the same as that using a repeat-naive technique.
  • FIG. 4B are boxplots that illustrate a comparison of gene detection distributions for each cohort (normal vs. pancreatic cancer) across coding genes, long noncoding RNAs, and transposable element (TE) subfamilies.
  • B. Increased Sample-to-Sample Correlation Detection of Patients ’ cfDNA Data
  • FIGS. 5A-5D illustrate the performance of providing sample-to-sample correlation of patients’ cfDNA data compared to conventional techniques such as well-annotated GENCODE coding and noncoding genes, as illustrated in FIGS. 5A-5D using pancreatic cancer samples and normal samples.
  • FIGS. 5 A and 5B illustrate the age and gender distribution of the two cohorts. It has been proved that age and gender do not affect the performance of the repeat-aware sequencing and analysis pipeline.
  • FIGS. 5C and 5D are heatmaps of Pearson correlation between pancreatic cancer samples using repeat-naive quantification and repeat-aware quantification respectively.
  • the heatmaps show that the repeat-aware sequencing platform increases the sample-to-sample correlation of pancreatic cancer patient cell-free RNA data, indicating greater overall similarity when examining a more robust annotation of their cell-free RNA transcriptomes.55
  • FIGS. 6A and 6B are scatter plots depicting transcripts-per-million abundance for transcripts detected in matched nanopore and Illumina libraries from two cohorts (pane 7 and pane 6) respectively.
  • the disclosed techniques demonstrate strong concordance between the well-annotated coding and noncoding genes, with a bias towards Illumina for TE RNAs and towards nanopore for some simple repeat RNAs.
  • Techniques disclosed herein may be wholly or in-part integrated into the repeat- aware sequencing and analysis pipeline.
  • Cell-free biopsy (e g., liquid or stool) is a non-invasive approach in diagnostics and cancer research, providing a means to detect and monitor various diseases, including cancer.
  • cell-free biopsy analyzes circulating biomarkers found in body fluids and feces such as blood, feces, urine, and saliva fluid. These biomarkers include DNA, RNA, proteins, and other molecules released by tumor cells into the bloodstream.
  • RNA liquid biopsy A component of cell-free biopsy is RNA liquid biopsy.
  • RNA liquid biopsy techniques enable the isolation and analysis of cell-free RNA fragments (RNA molecules freely circulating in different body fluids and stools), such as microRNAs, mRNAs, and TE RNAs, allowing the association of specific RNA features with different disease states. This non-invasive approach can be used for early disease detection, prognostic assessments, and personalized medicine.
  • the study of cfRNA opens new avenues for understanding cellular interactions and environments, revolutionizing diagnostics and therapeutics in fields ranging from cancer research to prenatal testing.
  • Body fluid samples are various fluids obtained from the human body, including blood, urine, saliva, cerebrospinal fluid, and the like. Each fluid type carries unique information about the body’s biological processes, making them valuable sources for biomarker discovery, disease detection, and monitoring. Analyzing these biological samples offers a non-mvasive access to discover disease mechanisms, identify biomarkers, and develop diagnostic and therapeutic strategies for improved patient care.
  • Biological fluid and/or stool samples may be collected from both healthy controls and patients.
  • Various biofluids and sample collection kits are available to obtain body fluid samples, including cell-free RNA blood collection tubes (e.g., K2-EDTA tubes) for stabilizing and preserving RNA in blood, as well as urine and saliva collection kits designed for reliable and convenient collection, ensuring RNA preservation for downstream analysis.
  • the obtaining the biological samples may comprise isolating extracellular vesicles from the sample from the subject.
  • a sample can be centrifuged to separate the cellular components from the cell-free fraction. Centrifugation conditions may vary depending on the specific sample type and the desired separation.
  • the extraction of cfRNA can be performed using a commercial extraction kit (e.g., any Norgen Biotek kits for Plasma/Serum RNA, cell- free RNA, circulating RNA, and/or exosomal RNA purification or any Qiagen kits for biofluid RNA and/or exosomal RNA purification).
  • kits provide reagents and protocols for efficient RNA extraction, such as silica-based columns or magnetic beads.
  • the extraction process can involve one or more of filtering, lysis of the cells, binding of the RNA to the extraction matnx, washing to remove contaminants, and elution of purified cfRNA.
  • the cfRNA is extracted or isolated from the extracellular vesicles
  • the ExoRNeasy kit (Qiagen) can be used to isolate cfRNA from blood plasma.
  • the blood plasma samples are obtained from deidentified healthy controls and patients (e.g., pancreatic cancer patients). Samples may be initially filtered through a filter (e.g., a 0.8 um filter) to remove any contaminants such as buffy coat. Filtered plasma can then be processed using the ExoRNeasy kit to isolate cell-free RNA according to manufacturer instructions.
  • a specific type of cfRNA may be targeted and captured from the sample or the cfRNA transcriptome.
  • a specific type of cfRNA is Polyadenylated RNA (poly-A RNA).
  • Poly-A RNA refers to the RNA molecules that possess a long tail consisting of adenine nucleotides (A's) at their 3' end.
  • A's adenine nucleotides
  • the utilization of poly-A RNA provides several advantages in sequencing and transcriptomic analysis. When constructing a sequencing library, a poly-A-specific approach can be employed, where an oligo binds specifically to the poly-A tail, enabling reverse transcription to generate complementary DNA (cDNA).
  • rRNA highly abundant ribosomal RNA
  • the poly-A enrichment strategy facilitates a more focused analysis of proteincoding RNA transcripts, enhancing the sensitivity and specificity of downstream sequencing analysis.
  • Poly-A RNA in the cfRNA may be captured after the extraction to obtain a poly-A RNA library.
  • the number of poly-A’ s can be predetermined based on the desired sensitivity or specificity. For example, in some instances, 4 A’s may be used for a poly-A RNA, while in other instances, 15 A’s may be used.
  • the number of poly-A’s can be at least 4, 5, 6, 7, 8, 9, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 200, or 300 or more.
  • a location of the poly-A RNA may also be relevant to disease diagnosis. PCR or other probe-based techniques can be used to detect the location of the poly-A RNA.
  • a copy DNA (cDNA) library can be used for downstream analysis.
  • cDNA a copy DNA library
  • a variety of methods can be used to prepare and generate the cDNA library.
  • full- length cDNA can be synthesized from the cfRNA from the healthy control samples and the patient samples using a commercial kit (e.g., the Takara SMART-Seq HT kit).
  • Nanopore full- length cDNA libraries may also be prepared using a similar method (e.g., the Oxford Nanopore ligation kit LSK109).
  • a size distribution of the cfRNA and the resulting cDNA may be evaluated using an analyzer (e.g., Agilent bioanalyzer).
  • Final libraries may be made using another commercial kit (e.g., the Illumina Nextera XT DNA Prep kit) and further sequenced using a high- throughput sequencer(e.g., Illumina NextSeq 500). If the nanopore techniques are used to generate the full-length cDNA libraries, the sequencing may be performed on an Oxford Nanopore MinlON device with R9.4 flowcells.
  • the disclosed techniques comprise sequencing and aligning the cfRNA extracted from a biological sample from a subject to produce cfRNA sequence reads.
  • the cfRNA sequence reads can then aligned and quantified using a repeat-aware transcriptome annotation or a repeat-aware genome annotation.
  • the aligning and quantifying the cfRNA sequence reads may comprise aligning the cfRNA sequence reads to an annotated reference sequence to obtain transcriptome annotations or genome annotations for the cfRNA sequence reads.
  • a plurality of regions of the annotated reference sequence can be annotated as corresponding to repeat subfamilies of a plurality of repeat families. For each repeat subfamily in the plurality of repeat families, a number of cfRNA sequence reads that align to a region annotated as corresponding to the repeat subfamily may be further determined.
  • the sequencing compnses reverse transcnbing the cfRNA using oligo dT primers or random hexamer primers to produce complementary DNA (cDNA) and sequencing the cDNA to produce cfRNA sequence reads. [0139] In some embodiments, the sequencing is performed on captured poly-A RNA sequences.
  • quality control is performed to the cfRNA sequence reads prior to the alignment.
  • the cfRNA sequence reads may be trimmed using read trimming tools (e.g., Trimmomatic).
  • the alignment may be performed, and a quality score may be assessed using an alignment quality control tool (e.g., FASTQC).
  • the alignment result and/or quality scores may be visualized using general bioinformatic tool such as Multi QC.
  • Aligning and Quantifying the cfRNA sequence reads is performed using a repeat- aware transcriptome annotation (or a repeat-aware genome annotation).
  • the repeat-aware transcriptome annotation comprises repeat elements across a human reference genome.
  • multiple transcriptome annotations may be used for the quantification.
  • An exemplary alignment and quantification may be performed using suitable techniques and tools (e.g., Salmon) following a custom designed protocol with two separate transcriptome annotations:
  • genes e.g., of the GENCODE consortium Hg38 reference annotation: 61,488 genes;
  • Repeat annotations e.g., a concatenation of the above annotation of genes and a customed repeat element (e.g., of Hg38 or other human reference genome). There can be up to about 5 million insertions.
  • FIG. 7 is an exemplary diagram illustrating a computational approach for the repeat- aware cfRNA profiling platform using a repeat-aware transcriptome annotation.
  • Repeat annotations 710 can be obtained using UCSC RepeatMasker, and Gene annotations 720 can be obtained using GENCODE.
  • the repeat annotations 710 and gene annotations 720 are combined to constitute a repeat-aware transcnptome annotation 730 (or a repeat-aware genome annotation).
  • the cfRNA sequence reads 740 may be aligned using the repeat-aware transcriptome annotation 730 using a suitable software such as Salmon. Quantifications for the cfRNA sequence reads can subsequently be obtained and used in downstream applications.
  • a subset of cfRNA sequences may be selected and quantified based on criteria.
  • the criteria may be based on a mapping strategy or mappability , a fragment-level GC content bias, a sequence-specific bias, an orphan mapping (i.e., if a mapping is found for only one end of a fragment or not match for both pairs), and/or a data-driven likelihood factorization.
  • the criteria can be capable of identifying selective alignment, reducing sequence biases, rescuing reads with unmapped pairs, and/or improving quantification accuracy.
  • a repeat-aware annotation aggregation may be performed (e.g., a transcript-to-gene aggregation) together with or after the alignment.
  • Quantification for each element in a subfamily is adding together and there is a total quantification determined for each subfamily. The total quantifications can be used as features for downstream analysis. This aggregation approach substantially reduces the feature number to be used in downstream application by quantifying only the subfamilies.
  • the features determined using the repeat-aware annotation aggregation are repeat- aware cfRNA features.
  • the aggregation approach significantly reduces the features (from about 5 million to about 15,000) that are used in downstream analysis without affecting specificity and sensitivity of the downstream analysis.
  • the repeat-aware cfRNA features can be incorporated into a machine learning model for disease classification.
  • the repeat-aware cfRNA features may be input into a feature vector for a model training or disease classification.
  • single molecule sequencing techniques e g., nanopore
  • the nanopore sequencing techniques can provide long reads data using a similar repeat-aware reference as describe above.
  • a sequence read obtained using the single molecule sequencing techniques covers the entire or a part of a repeat-element region and a coding/non-coding region.
  • the sequence read may be used to determine neoantigens as new therapeutics for a personalized cancer treatment.
  • the determined neoantigens may be encoded to produce synthetic mRNA vaccines.
  • the neoantigen sequences may be used to produce recombinant proteins (or expression vectors) that can be used for producing the synthetic mRNA vaccines.
  • the quantification of the cfRNA sequence reads is normalized. The normalization can be performed by accounting for sequencing depth and RNA composition (e.g., DESeq2 count normalization via median of ratios method). For example, a variance stabilizing transformation (VST) may be applied to the quantification of the cfRNA sequence reads, which helps to stabilize the variance and reduce the dependency of dispersion on mean expression levels. Other suitable normalization methods may also be used to ensure accurate and reliable quantification.
  • VST variance stabilizing transformation
  • the normalized quantification can be performed among different testing groups or cohorts.
  • a variety of standards may be used for grouping and the grouping may be performed internally according to the specific need or externally provided by a third party.
  • an internal cohort may be grouped based on age, gender, input volume (e.g., ranging from about 0.1 ml to about 10 ml), and/or a first set of conditions
  • an external cohort may be grouped based on age, gender, and/or a second set of conditions.
  • the first set of conditions may be overlapping with, same as, or disjunct from the second set of conditions.
  • a differential expression (DE) analysis may be performed in the repeat-aware analysis using the internal and/or external cohort data.
  • an unsupervised analysis e.g., principal component analysis (PCA)
  • PCA principal component analysis
  • cfRNA length analysis may be performed.
  • Repeat elements demonstrate distinct biological patterns among patients and healthy subjects, as well as different disease subjects. As discussed in Section I, the repeat-aware analysis that takes quantifications of repeat elements provides enhanced performance in distinguishing patient samples from healthy controls.
  • the repeat superfamily, family, subfamily, and element information can be obtained from repeat annotations (e.g., RepeatMasker, Dfam, Repbase) of a human reference genome (e.g., genome browser, genome assembly, T2T genome, pangenome).
  • repeat annotations e.g., RepeatMasker, Dfam, Repbase
  • a human reference genome e.g., genome browser, genome assembly, T2T genome, pangenome.
  • various implementation can use at least 3, 4, 5, or 6 SINE families (e g., Alu), 4, 5, 6, 7, or 8 LINE families (e.g., LI), and 2, 3, 4, 5 or 6 LTR families (e.g., ERV1, ERVK, ERVL).
  • SINE families e g., Alu
  • 4, 5, 6, 7, or 8 LINE families e.g., LI
  • 2, 3, 4, 5 or 6 LTR families e.g., ERV1, ERVK, ERVL.
  • Specific cfRNA repeats may be enriched in samples having a certain type of disease. Therefore, it can be meaningful and efficient to aggregate cfRNA features based on repeats and subfamilies.
  • Alu subfamily elements are the most enriched TE signal in pancreatic cancer patient cell-free RNA, with AluY, AluSc, AluSg7, AluSc8, AluSx3, and AluSg subfamily elements among the most significantly enriched when compared to healthy individuals.
  • Aggregating cfRNA features using Alu subfamily can substantially improve the diagnosis effectiveness and efficiency using the TE repeats in this subfamily .
  • the aggregating is performed by adding the quantifications in each member of a subfamily together to provide an aggregated quantification for the subfamily.
  • Exemplary repeat-aware analysis using the aggregated features include differential expression (DE) analysis, entropy analysis, unsupervised analysis, and cfRNA length analysis.
  • DE differential expression
  • a quantification of each aggregate feature based on a subfamily level or a gene level may be first normalized before performing the repeat-aware analysis. Normalization may be performed using the median of ratios (e.g., DESeq2) method, where counts are divided by sample-specific size factors determined by median ratio of gene counts relative to geometric mean per gene.
  • DESeq2 median of ratios
  • pair-wise comparisons may be determined between testing cohorts, e.g., comparing pancreatic cancer patients with healthy individuals.
  • the cohort may comprise subjects grouped based on certain conditions. Exemplary conditions include age, gender, and input volume.
  • entropy is used to measure differences in variability of the aggregated quantifications.
  • an entropy is calculated on a per-sample basis for each biotype/subfamily. The entropy may be calculated using:
  • Hbi type ⁇ i Pi log 2 (Pi) Equation 1
  • p represents the fractional contribution of a given feature I (total of ri) belonging to the biotype of interest to the total biotype abundance.
  • the DE results and/or the entropy may be used in disease classification and performance evaluation.
  • the normalized quantifications can be used to derive statistically meaningful information.
  • a predetermined number e.g., 50
  • PCs principal components
  • the PCA may be performed using a count matrix filtered to include only genes with non-zero standard deviation across samples.
  • correlations among the PCs may be further calculated and a clustering heatmap may be generated to visualize the correlations.
  • the correlations and/or the heatmap may further be used in downstream applications such as disease diagnosis.
  • IncRNA, protein-coding, and transposable element reference sequences are retrieved using techniques disclosed herein. Sequences may be extracted from a human reference genome (e.g., Hg38) to create the biotype reference genomes used in the length analyses. Nanopore reads obtained using a nanopore sequencing technique may be aligned to the human reference genome using a mapping tool (e.g., Minimap2). To determine alignments in genomic regions with overlapping annotations, the length of the aligned fragment is compared to the lengths of the overlapping repeats. The annotation with the closest length to that of the fragment may be chosen as the correct alignment.
  • a mapping tool e.g., Minimap2
  • a second set of samples may be obtained from the subj ect or a second set of cfRNAs may be extracted from the biological sample to perform PCR-based analysis and/or detection.
  • the PCR used may be customarily designed with primers to detect the specific repeat-aware features.
  • a second round of sequencing, alignment, and quantification may be performed using products of the PCR-based assays or tests.
  • the repeat-aware sequencing and analysis platform also comprises providing interpretation of the repeat-derived cfRNA data and/or the repeat subfamilies.
  • the disclosed repeat-aware analysis contextualizes the differences in the respective repeat superfamilies.
  • the platform may provide that SINE-derived cell-free RNAs are uniform in their enrichment in the pancreatic cancer patient cohort, while simple repeat-derived cell-free RNAs are far more divergent, with some simple repeat- derived cell-free RNAs being enriched in healthy individuals.
  • FIG. 8 is an exemplary flowchart illustrating a method 800 for repeat-aware analysis according to various embodiments. It can be performed by any techniques described herein. For example, aspects of method 800 can be performed in a similar manner as those performed in the pipeline 100 of FIG. 1, or using a repeat-aware cfRNA profiling platform, for example, the platform illustrated in FIG. 7.
  • cfRNA from a biological sample is sequenced to obtain cfRNA sequence reads.
  • the cfRNA may be sequenced using a single molecule sequencer, for example, a nanopore sequencer.
  • the sequencing may be performed by reverse transcribing the cfRNA using oligo dT primers or random hexamer primers to produce complementary DNA (cDNA) and then sequencing the cDNA to produce cfRNA sequence reads.
  • the method 800 further comprises isolating extracellular vesicles from the biological sample from the subject and isolating the cfRNA from the extracellular vesicles using suitable techniques including those disclosed in Sections 11(A) and 11(B).
  • the cfRNA sequence reads are aligned using a repeat-aware transcriptome annotation or a repeat-aware genome annotation to obtain repeat-aware cfRNA sequence reads.
  • the aligning the cfRNA sequence reads is performed by aligning the cfRNA sequence reads to an annotated reference sequence, and a first plurality of regions of the annotated reference are annotated as corresponding to genes and a second plurality of regions are annotated as corresponding to repeat elements of a plurality of repeat subfamilies.
  • the repeat-aware cfRNA sequence reads are quantified, and the quantifications may be used in downstream analysis.
  • the cfRNA sequence reads are quantified for each gene or repeat element by quantifying cfRNA sequence reads that align to a region annotated as corresponding to the gene or the repeat element.
  • repeat-aware cfRNA features are generated at a repeat subfamily level by aggregating repeat-aware cfRNA sequence reads from a same repeat subfamily to obtain the repeat-aware cfRNA features at the repeat subfamily level.
  • the repeat-aware cfRNA features may be incorporated into a machine learning model for disease classification.
  • the machine learning model may be pretrained using various techniques including those disclosed in Section V(A).
  • a disease is detected in the subject when one or more repeat- aware cfRNA sequence reads or repeat-aware cfRNA features are detected.
  • a disease may also be detected when the quantification of the repeat-aware cfRNA sequence reads meets predetermined criteria.
  • the disease may be a cancer, a cardiovascular disease, or a neurodegenerative disease. Repeat elements are known to be dysregulated and aberrantly expressed in various neurodegenerative diseases. Copley KE, Shorter J. Repetitive elements in aging and neurodegeneration. Trends Genet. 2023 May;39(5):381-400. doi:
  • the cardiovascular disease may be the result of a viral infection.
  • the disease may be detected in a subject when the one or more repeat-aware cfRNA sequence reads or repeat-aware cfRNA features of the subject is at a level that is greater than a level of corresponding repeat-aware cfRNA sequence reads or repeat-aware cfRNA features in subjects that do not have the disease (“a normal lever). For example, when the one or more repeat-aware cfRNA sequence reads or repeat-aware cfRNA features of the subject is at a level that is at least about 2 times greater than the normal level, the disease is determined to be detected in the subject.
  • a disease is detected in the subject by performing a differential expression analysis between a test subject and subjects that do not have the disease using the one or more repeat-aware cfRNA sequence reads or repeat-aware cfRNA features.
  • a disease is detected in the subject by incorporating the one or more repeat-aware cfRNA sequence reads or repeat-aware cfRNA features into a machine learning model.
  • the machine learning model may be a binary classifier or a multi-class classifier for disease classification.
  • a tissue of origin of the disease in the subject is identified by detecting one or more tissue-specific, the one or more repeat-aware cfRNA sequence reads or repeat-aware cfRNA features for deconvolution of cfRNA transcriptome.
  • the deconvolution of the cfRNA transcnptome can identify the relative contributions of different cell types to the cfRNA transcriptome by identifying distinct gene expression patterns and/or repeat element patterns of the cfRNA sequencing data.
  • a reference dataset may be constructed, which includes gene expression profiles and/or repeat element profiles. Using the reference dataset, computational algorithms (e.g., CIBERSORT or xCell) are applied to deconvolute the cfRNA transcriptome.
  • a target treatment is selected for the detected disease based on the one or more repeat-aware cfRNA sequence reads or repeat-aware cfRNA features.
  • the target treatment may be administered to the subj ect and an efficacy of the targeted treatment may be monitored.
  • the monitoring may be performed by obtaining a biological sample of the subject and performing the sequencing, aligning, and quantifying steps at block 810, 820, and 830 in a predetermined frequency, for example, every six months, or once per year.
  • the disclosed techniques of using the repeat-aware sequencing and analysis pipeline provide disease-specific biomarkers and corresponding patterns that are informative in downstream diagnostic applications.
  • FIGS. 9 and 10A illustrate that in healthy individuals, less than 10% of the DESeq2 normalized cell-free RNA counts consistently corresponded to repeats. However, signaficantly larger fractions of repeat-derived cell-free RNAs are demonstrated across almost all pancreatic cancer patients. It is noteworthy that most of these cell-free RNAs in the pancreatic cancer patients are derived from SINEs. These examples demonstrate that the DESeq2 normalized cell-free RNA counts can be used as a biomarker to distinguish pancreatic cancer patients from healthy subjects and for further downstream analysis. [0174] FIG.
  • FIG. 9 is a bar chart that illustrates a distribution of biotype representation (by DESeq2 normalized count) in cell-free RNA-seq quantifications for samples from each cohort (normal samples and pancreatic cancer patient by stage), shaded by GENCODE biotype or repeat subfamily, and facetted by stage (normal samples NA). It can be shown from the figure that cancer patient plasma cell-free RNA has a much higher fraction of repeat signal compared to healthy subjects.
  • Significant differences in the information content of protein-coding RNAs, IncRNAs, long terminal repeats (LTR), SINE, and simple repeat superfamilies are demonstrated in FIG. 10A.
  • LTR long terminal repeats
  • SINE simple repeat superfamilies
  • a machine learning model is trained or pretrained using results from the repeat-aware analysis to provide disease classification, prognosis, and/or treatment.
  • a number of feature sets belonging to the following three categories may be selected and used as input matrices for the model training:
  • Entropy TE Clade entropy and TE Clade entropy plus Repeat-naive features.
  • a feature set is selected for the model training.
  • a feature set comprising differentially expressed genes is selected for a machine learning model.
  • the machine learning model may be trained using different disease cohorts and one or more corresponding health cohort. Each disease cohort is paired with a healthy cohort.
  • the training data may be split into stratified training (e.g., 80%) and testing (e.g., 20%) subsets.
  • the machine learning model is a logistic regression model.
  • the logistic regression model may be optimized by performing cross-validated classification with higher number of folds (e.g., using 10-fold cross-validation instead of 5-fold) using elastic net penalty values. An optimized feature selection may be further performed via regularizing model parameters.
  • a final model trained on the entire training split with selected features may be provided. Top performing models (e.g., using a pre-determined loss function) are identified by training sensitivity at 90% specificity .
  • the machine learning model is a binary or multi-class classifier. In some embodiments, the machine learning model is an artificial neural network model.
  • differential expression may be calculated using only the training split and excluding testing samples. Model performance is evaluated by generating prediction probabilities on the test split samples and classifying based on the 90% specificity probability threshold defined in training. Features with non-zero coefficients in the final models are identified to determine a total feature size. Confidence intervals for sensitivity are estimated as binomial confidence intervals based on the successful observations and the total training/testmg cohort.
  • a subset of the repeat subfamilies of the plurality of repeat families is selected as features for the machine learning model and the selected features may be output for further analysis.
  • FIG. 10B is a volcano plot of differentially expressed genes/repeat subfamilies derived from repeat-aware quantification with horizontal and vertical lines drawn at -loglO(O.Ol) and 0, respectively.
  • Alu subfamily elements were the most enriched TE signal in pancreatic cancer patient cell-free RNA, with AluY, AluSc, AluSg7, AluSc8, AluSx3, and AluSg subfamily elements among the most significantly enriched (p ⁇ 0.01) in pancreatic cancer patients when compared to healthy individuals.
  • Diseases may be detected using the trained machine learning model.
  • the detecting the disease may comprise detecting one or more cfRNA sequence reads quantified using the repeat-aware transcriptome annotation or the repeat-aware genome annotation, and determining if the one or more cfRNA sequence reads at a level that is greater than a level of corresponding cfRNA sequence reads in subjects that do not have the disease, or at a level that is at least about 2 times greater than a level of corresponding cfRNA sequence reads in subjects that do not have the disease.
  • the repeat-aware analysis pipeline also comprises clustering subjects based on the informative biomarkers (e.g., the repeat-aware features).
  • FIG. 12 is a heatmap illustrating using normalized counts of SINEs and simple repeats (i.e., DNA stretches consisting of short, tandemly repeated di-, tri-, tetra-or penta-nucleotide motifs) to cluster pancreatic cancer samples/subjects.
  • simple repeats i.e., DNA stretches consisting of short, tandemly repeated di-, tri-, tetra-or penta-nucleotide motifs
  • Prognostic knowledge may also be learned from the analysis results. For example, as shown in the heatmap of FIG. 13D, a lower expression level of a biomarker may indicate a higher survival rate (e.g., lower levels of MIC target gene expression suggest higher survival rates). The expression level may also be associated with survival times.
  • a different trained machine learning model e.g., a convolutional neural network
  • FIGS. 14A-14H are statistical graphs illustrating prognostic potential of differentially expressed cell-free RNA.
  • FIG. 14A shows Kaplan-Meier (KM) overall survival (O.S.) curve for TCGA pancreatic cancer samples stratified into thirds based on expression of GENCODE-annotated genes significantly enriched in pancreatic cancer patient cell-free RNA.
  • FIG. 14B shows cell-free RNA abundance in normalized (norm.) counts for top 5 (padj) differentially expressed (pane v control) GENCODE up genes.
  • FIG. 14C shows KM O.S. curve for TCGA pancreatic cancer samples stratified into thirds based on expression of repeat elements significantly depleted in pancreatic cancer patient cell-free RNA.
  • FIG. 14D shows cell-free RNA abundance in normalized (norm.) counts for top 5 (padj) differentially expressed (pane v control) Repeat down genes.
  • FIG. 14E shows KM O.S. curve for TCGA pancreatic cancer samples stratified into thirds based on expression of oxidative phosphorylation hallmark gene set genes significantly enriched in pancreatic cancer patient cell-free RNA.
  • FIG. 14F shows cell-free RNA abundance in normalized (norm.) counts for top 5 (padj) differentially expressed (pane v control) Oxidative Phosphorylation genes.
  • FIG. 14G shows KM O.S.
  • FIG. 14H shows cell-free RNA abundance in normalized (norm.) counts for top 5 (padj) differentially expressed (pane v control) MYC Targets genes.
  • TCGA RNA-seq data are recomputed using the repeat-aware analysis to determine the prognostic potential of TE-derived cell-free RNAs, as determined by Kaplan-Meier analysis.
  • a tissue of origin of the disease in the test subject may be also identified using the analysis and/or the classification results.
  • the identifying the tissue of origin of the disease may comprise detecting one or more tissue-specific, repeat-aware cfRNA sequence reads for deconvolution of the cfRNA transcriptome.
  • a treatment for the detected disease may be determined or developed based on the one or more cfRNA sequence reads quantified using a repeat-aware transcriptome annotation or the repeat-aware genome annotation.
  • the treatment may be monitored using cfRNA sequencing data. Additional samples may be collected and obtained during the treatment on a prescribed frequency, and quantifications of the cfRNA data are used to monitor the treatment and evaluate the efficacy of the treatment.
  • the pipeline may further comprise administering a treatment to the disease based on biomarker information learned from the analysis and classification. For examples, when knowing that AluY, AluSc, AluSg7, AluSc8, AluSx3, and AluSg subfamily elements are significantly enriched in pancreatic cancer patients, further analysis, diagnosis, and prognosis can be made based on regulating the Alu subfamily elements.
  • FIG. 1 shows that AluY, AluSc, AluSg7, AluSc8, AluSx3, and AluSg subfamily elements are significantly enriched in pancreatic cancer patients.
  • 15C is an MA plot of log2FoldChange between pancreatic cancer and normal samples compared to log-scale baseMean derived from DESeq2 It is shown that the upregulated Alu elements in pancreatic cancer cell-free RNA are highly abundant, despite the lack of increase in overall SINE complexity in the pancreatic cancer cell-free RNA transcriptome (see, e.g., FIG. 10A).
  • the disclosed techniques are general and applicable to various biomarker identification and disease diagnosis.
  • the repeat-aware sequencing and analysis platform is used to provide quantification to analyze lung, liver, esophageal, colorectal, and stomach cancer cell-free RNA-seq data, along with their corresponding healthy controls. Repeat superfamily variability is observed in both healthy and cancer cell-free RNA transcriptomes (see, e.g., FIGS.
  • mapping rate by using the repeat-aware sequencing annotation for analyzing esophageal, liver, and stomach cancer cell-free RNA transcriptomes (see, e.g., FIG. 16B).
  • FIG. 17A illustrates a distribution of repeat representation (by DESeq2 normalized count) in cell-free RNA-seq quantifications for pancreatic cancer, shaded by repeat subfamily, and facetted by stage according to certain embodiments.
  • FIG. 17B illustrates a distribution of repeat representation (by DESeq2 normalized count) in cell-free RNA-seq quantifications for each cancer type, shaded by repeat subfamily according to certain embodiments.
  • FIGS. 18A-18T are statistical plots that illustrate that using repeat-aware sequencing features can improve performance of diagnostic models. Optimized repeat-aware models are compared to their repeat-naive counterparts, revealing repeat-driven increases in both area under the curve (AUC) (see, e.g., FIGS. 18A, 18E, 181 18M, and 18Q) and enhanced training sensitivity (liver cancer: 86% sensitivity (FIGS. 18A and 18B), esophageal cancer: 56% sensitivity (FIGS. I8F and I8G), colorectal cancer 91% sensitivity (FIGS. 18J and 18K), stomach cancer: 86% sensitivity (FIGS. 18N and 180), and lung cancer: 93% sensitivity (FIGS. 18R and 18S)) at a 90% specificity.
  • AUC area under the curve
  • FIG. 18A shows that a large repeat fraction is demonstrated on liver cancer subjects, meaning that liver cancer has a corresponding dependence on repeat-aware features for classification, so that using large repeat fraction as a feature in the training results in a significant (p ⁇ 0.05) improvement in training sensitivity.
  • FIGS. 18C, 18G, 18K, 180, and 18S illustrate that the classification performance in the respective testing cohorts largely reflects the improvements seen in training, suggesting that the disclosed techniques have the potential to generalize to unseen data. Notably, cancer-specific differences in repeat-aware feature dependence for disease classification are also observed.
  • Stomach FIG. 18P
  • colorectal FIG.
  • FIG. 18L cancer models each utilizes 1 repeat-aware feature
  • liver (FIG. 18D) and esophageal (FIG. 18H) cancer models utilize 5 and 10 repeat-aware features, respectively
  • the lung (FIG. 18T) cancer model utilizes a variety of repeat-aware features, as well as the most overall features (2118 total features).
  • repeat-aware features enhance disease classification, highlighting the potential of the repeat-aware sequencing and analysis platform for highly sensitive and specific disease diagnosis.
  • Pair-wise comparisons are also performed between 5 different cancers and healthy individuals captured robust and significant (p ⁇ 0.01) differential expression of repeat-derived cell-free RNAs that are characteristic to each cancer type (see, e.g., FIGS. 13A-13E). Additional analyses of the significantly differentially expressed repeat subfamilies also show that these repeat-derived cell-free RNAs are highly abundant (see, e.g., FIGS. 19A-19F), with significant changes to biotype/repeat superfamily entropy (p ⁇ 0.05) (FIG. 19F). Significantly DE genes/ repeat subfamilies are full opacity and shade. FIG.
  • RNA transcriptomic signatures with diagnostic or prognostic potential
  • the differentially expressed genes in pancreatic cancer or COVID-19 patient cell-free RNA relative to healthy individuals are examined using the repeat-aware sequencing and analysis pipeline, as shown in FIGS. 21 A-21B and 22A-22E.
  • I l l well- annotated GENCODE genes were significantly enriched or depleted in pancreatic cancer patients (FIG. 21A), while 900 GENCODE genes were enriched or depleted in COVID-19 patients (FIG. 21B).
  • the examples and figures reveal the value and utility of broadly characterizing the cell-free RNA transcriptome using the disclosed techniques. It has been found using the disclosed techniques that the vast noncoding and repeat-derived cell-free RNA transcriptome is a rich source of novel, abundant, and disease-specific RNA biomarkers, that repeat-derived cell-free RNAs, including simple repeat RNAs and TE RNAs transcribed from LINE, SINE, and LTR elements, are cancer-specific RNAs that are normally present at low' or undetectable levels in healthy individuals, and that TE-derived cell-free RNAs are highly enriched in cancer patient plasma, with each cancer type exhibiting its own characteristic TE-derived cell-free RNA signature.
  • the disclosed techniques greatly reduce the number of features used for downstream analysis and disease classification from over 5 million repeat element insertions to about 15,000 aggregated repeat elements at the subfamily level.
  • the disclosed techniques achieve highly accurate disease classification by incorporating protein-coding RNAs, noncoding RNAs such as IncRNAs, and repeat-derived RNAs.
  • the disclosed techniques use long-read RNA-seq platforms such as singlemolecule nanopore sequencing and provide additional information regarding the true length of cell-free RNAs.
  • the differences in cell-free RNA length may be served as additional disease-specific features to further improve disease classification.
  • the disclosed techniques also robustly characterize repeat-derived cell-free RNAs in both polyA-selected and total RNA library preparation protocols. In both cell-free RNA-seq contexts, the repeat-aware sequencing and analysis increase mapping rate significantly and provide a richer feature space that leverages highly abundant and disease-specific repeat-derived cell-free RNAs to improve classification performance.
  • the disclosed techniques can be also used to provide systemic insights into disease pathogenesis, as well as discovering novel drug targets for diseases such as cancer. Additionally, the disclosed repeat-aware sequencing and analysis platform enables non- invasive, systemic monitoring of protein-coding and repeat-derived cell-free RNA responses to targeted therapies such as KRAS inhibitors, which induce treated cancer cells to preferentially release TE-derived cell-free RNAs in extracellular vesicles. Given the preferential upregulation and secretion of TE-derived cell-free RNAs in response to mutant KRAS, companion diagnostic tests may be developed using the disclosed techniques that enable the robust detection of TE-derived cell-free RNA signatures of oncogenic RAS signaling.
  • targeted therapies such as KRAS inhibitors
  • the disclosed techniques may further be implemented in clinical testing and diagnosis. Additional larger and more diverse cell-free RNA-seq datasets may be generated across different cancer types and stages to further improve diagnostic performance clinically. The disclosed techniques are also informative in cancer early detection by leveraging highly abundant TE- and other repeat-derived cell-free RNA features that are disease-specific. D. Flowchart
  • FIG. 23 is an exemplary flowchart illustrating a method 2300 according to various embodiments. It can be performed by any techniques described herein. For example, aspects of method 2300 can be performed in a similar manner as those performed in the pipeline 100 of FIG. 1, or using a repeat-aware cfRNA profiling platform, for example, the platform illustrated in FIG. 7.
  • a biological sample (e.g., a body fluid sample or a stool sample) is obtained from each training subject.
  • the training subjects include both patient samples and normal (or healthy) samples.
  • a first subset of the training subjects is labeled as having a particular type of cancer, and a second subset of the plurality of training subjects is labeled as not having a particular type of cancer.
  • cfRNA is extracted from each biological sample.
  • the extracting comprises capturing polyadenylated cell-free RNA (poly -A cfRNA) in the cfRNA to obtain a poly-A cfRNA library. It may be required based on sensitivity or clinical/commercial use that the poly-A cfRNA has at least a predetermined number of A’s at a 3’ end. In some embodiments, the predetermined number is 4.
  • Various techniques may be used to capture the poly-A cfRNA, including those disclosed in Section 11(C).
  • the poly-A cfRNA is sequenced to obtain sequence reads.
  • the poly- A cfRNA may be sequenced using a single molecule sequencer, e.g., a nanopore sequencer.
  • Other sequencing techniques can be used, e.g., ones that only sequence a portion of a nucleic acid molecule, such as only one or both ends.
  • the sequence reads corresponded to each subject are aligned to an annotated reference to obtain repeat-aware sequence reads, with a first plurality of regions of the annotated reference are annotated as corresponding to genes and a second plurality of regions are annotated as corresponding to repeat elements of a plurality of repeat subfamilies.
  • the repeat-aware sequence reads are quantified for each gene or repeat element by quantifying repeat-aware sequence reads that align to a region annotated as corresponding to the gene or the repeat subfamily.
  • a feature vector that includes a first number of repeat-aware sequence reads that align to genes and a second number of repeat-aware sequence reads that align to each repeat subfamily of the plurality of subfamilies is generated.
  • the feature vector may be also associated with the labels corresponding to the subjects.
  • Blocks 2310-2360 may be performed iteratively for each training subject, and feature vectors may be generated after a full run of the iterations.
  • a machine learning model is trained using the feature vectors and the labels for the training subjects. Based on the training, a subset of the repeat subfamilies of the plurality of repeat subfamilies may be selected as features for the machine learning model and the selected features can be output for downstream applications.
  • method 2300 further comprises a testing/ validating process.
  • Poly-A cfRNA in a test biological sample of a test subject may be analyzed to obtain test sequence reads, based on annotations of the test sequence reads in the annotated reference, a test feature vector that includes a first number of test sequence reads that align to genes and a second number of test sequence reads that align to each repeat subfamily of the plurality of subfamilies may be generated.
  • a disease classification is determined by operating on the feature vector using the trained machine learning model.
  • the analyzing the poly-A RNA in the test biological sample may use PCR to quantify the test sequence reads.
  • the disease may be a cancer, a cardiovascular disease, or a neurodegenerative disease.
  • cDNA from the poly-A cfRNA is synthesized to obtain a cDNA library and the cDNA in the cDNA library is sequenced to obtain the sequence reads.
  • a poly-A enriched cfRNA cDNA library may also be generated from the biological sample by reverse transcribing the poly-A cfRNA, and the poly-A enriched cfRNA cDNA library is sequenced to obtain the sequence reads.
  • Various techniques may be used to synthesize the cDNA or generated the cDNA library, including those disclosed in Section 11(D).
  • FIG. 24 illustrates a measurement system 2400 according to an embodiment of the present disclosure.
  • the system as shown includes a sample 2405, such as cell-free nucleic acid molecules (e.g., DNA and/or RNA) within an assay device 2410, where an assay 2408 can be performed on sample 2405.
  • sample 2405 can be contacted with reagents of assay 2408 to provide a signal of a physical characteristic 2415 (e.g., sequence information of a cell-free nucleic acid molecule).
  • a physical characteristic 2415 e.g., sequence information of a cell-free nucleic acid molecule.
  • An example of an assay device can be a flow cell that includes probes and/or primers of an assay or a tube through which a droplet moves (with the droplet including the assay).
  • Physical characteristic 2415 e.g., a fluorescence intensity, a voltage, or a current
  • Detector 2420 can take a measurement at intervals (e.g., periodic intervals) to obtain data points that make up a data signal.
  • an analog-to-digital converter converts an analog signal from the detector into digital form at a plurality of times.
  • Assay device 2410 and detector 2420 can form an assay system, e.g., a sequencing system that performs sequencing according to embodiments described herein.
  • a data signal 2425 is sent from detector 2420 to logic system 2430.
  • data signal 2425 can be used to determine sequences and/or locations in a reference genome of nucleic acid molecules (e.g., DNA and/or RNA).
  • Data signal 2425 can include various measurements made at a same time, e.g., different colors of fluorescent dyes or different electrical signals for different molecule of sample 2405, and thus data signal 2425 can correspond to multiple signals.
  • Data signal 2425 may be stored in a local memory 2435, an external memory 2440, or a storage device 2445.
  • the assay system can be comprised of multiple assay devices and detectors.
  • Logic system 2430 may be, or may include, a computer system, ASIC, microprocessor, graphics processing unit (GPU), etc. It may also include or be coupled with a display (e.g., monitor, LED display, etc.) and a user input device (e.g., mouse, keyboard, buttons, etc.). Logic system 2430 and the other components may be part of a stand-alone or network connected computer system, or they may be directly attached to or incorporated in a device (e.g., a sequencing device) that includes detector 2420 and/or assay device 2410. Logic system 2430 may also include software that executes in a processor 2450.
  • a display e.g., monitor, LED display, etc.
  • a user input device e.g., mouse, keyboard, buttons, etc.
  • Logic system 2430 and the other components may be part of a stand-alone or network connected computer system, or they may be directly attached to or incorporated in a device (e.g., a sequencing device) that includes
  • Logic system 2430 may include a computer readable medium storing instructions for controlling measurement system 2400 to perform any of the methods described herein.
  • logic system 2430 can provide commands to a system that includes assay device 2410 such that sequencing or other physical operations are performed. Such physical operations can be performed in a particular order, e.g., with reagents being added and removed in a particular order. Such physical operations may be performed by a robotics system, e.g., including a robotic arm, as may be used to obtain a sample and perform an assay.
  • Measurement system 2400 may also include a treatment device 2460, which can provide a treatment to the subject.
  • Treatment device 2460 can determine a treatment and/or be used to perform a treatment. Examples of such treatment can include surgery, radiation therapy, chemotherapy, immunotherapy, targeted therapy, hormone therapy, and stem cell transplant.
  • Logic system 2430 may be connected to treatment device 2460, e.g., to provide results of a method described herein.
  • the treatment device may receive inputs from other devices, such as an imaging device and user inputs (e.g., to control the treatment, such as controls over a robotic system).
  • a computer system includes a single computer apparatus, where the subsystems can be the components of the computer apparatus.
  • a computer system can include multiple computer apparatuses, each being a subsystem, with internal components.
  • a computer system can include desktop and laptop computers, tablets, mobile phones and other mobile devices.
  • FIG. 25 The subsystems shown in FIG. 25 are interconnected via a system bus 75. Additional subsystems such as a printer 74, keyboard 78, storage device(s) 79, monitor 76, which is coupled to display adapter 82, and others are shown. Peripherals and input/output (I/O) devices, which couple to I/O controller 71, can be connected to the computer system by any number of means known in the art such as input/output (I/O) port 77 (e g., USB, FireWire®).
  • I/O input/output
  • I/O port 77 or external interface 81 can be used to connect computer system 10 to a wide area network such as the Internet, a mouse input device, or a scanner.
  • the interconnection via system bus 75 allows the central processor 73 to communicate with each subsystem and to control the execution of a plurality of instructions from system memory 72 or the storage device(s) 79 (e.g., a fixed disk, such as a hard drive, or optical disk), as well as the exchange of information between subsystems.
  • the system memory 72 and/or the storage device(s) 79 may embody a computer readable medium.
  • Another subsystem is a data collection device 85, such as a camera, microphone, accelerometer, and the like. Any of the data mentioned herein can be output from one component to another component and can be output to the user.
  • a computer system can include a plurality of the same components or subsystems, e.g., connected together by external interface 81, by an internal interface, or via removable storage devices that can be connected and removed from one component to another component.
  • computer systems, subsystem, or apparatuses can communicate over a network.
  • one computer can be considered a client and another computer a server, where each can be part of a same computer system.
  • a client and a server can each include multiple systems, subsystems, or components.
  • methods may involve various numbers of clients and/or servers, including at least 10, 20, 50, 100, 200, 500, 1,000, or 10,000 devices.
  • Methods can include various numbers of communication messages between devices, including at least 100, 200, 500, 1,000, 10,000, 50,000, 100,000, 500,00, or one million communication messages. Such communications can involve at least 1 MB, 10 MB, 100 MB, 1 GB, 10 GB, or 100 GB of data.
  • aspects of embodiments can be implemented in the form of control logic using hardware (e.g., an application specific integrated circuit or field programmable gate array) and/or using computer software with a generally programmable processor in a modular or integrated manner.
  • a processor includes a single-core processor, multi-core processor on a same integrated chip, or multiple processing units on a single circuit board or networked.
  • Any of the software components or functions described in this application may be implemented as software code to be executed by a processor using any suitable computer language such as, for example, Java, C, C++, C#, Objective-C, Swift, or scripting language such as Perl or Python using, for example, conventional or object-oriented techniques.
  • the software code may be stored as a series of instructions or commands on a computer readable medium for storage and/or transmission.
  • a suitable non-transitory computer readable medium can include random access memory (RAM), a read only memory (ROM), a magnetic medium such as a hard-drive or a floppy disk, or an optical medium such as a compact disk (CD) or DVD (digital versatile disk), flash memory, and the like.
  • the computer readable medium may be any combination of such storage or transmission devices.
  • Such programs may also be encoded and transmitted using carrier signals adapted for transmission via wired, optical, and/or wireless networks conforming to a variety of protocols, including the Internet.
  • a computer readable medium may be created using a data signal encoded with such programs.
  • Computer readable media encoded with the program code may be packaged with a compatible device or provided separately from other devices (e.g., via Internet download). Any such computer readable medium may reside on or within a single computer product (e.g. a hard drive, a CD, or an entire computer system), and may be present on or within different computer products within a system or network.
  • a computer system may include a monitor, printer, or other suitable display for providing any of the results mentioned herein to a user.
  • Any of the methods described herein may be totally or partially performed with a computer system including one or more processors, which can be configured to perform the steps. Any operations performed with a processor (e.g., aligning, determining, comparing, computing, calculating) may be performed in real-time.
  • the term “real-time” may refer to computing operations or processes that are completed within a certain time constraint. The time constraint may be 1 minute, 1 hour, 1 day, or 7 days.
  • embodiments can be directed to computer systems configured to perform the steps of any of the methods described herein, potentially with different components performing a respective step or a respective group of steps. Although presented as numbered steps, steps of methods herein can be performed at a same time or at different times or in a different order.
  • portions of these steps may be used with portions of other steps from other methods. Also, all or portions of a step may be optional. Additionally, any of the steps of any of the methods can be performed with modules, units, circuits, or other means of a system for performing these steps.

Landscapes

  • Engineering & Computer Science (AREA)
  • Medical Informatics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Public Health (AREA)
  • Data Mining & Analysis (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Databases & Information Systems (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Epidemiology (AREA)
  • Biophysics (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Pathology (AREA)
  • Analytical Chemistry (AREA)
  • Biomedical Technology (AREA)
  • Chemical & Material Sciences (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Primary Health Care (AREA)
  • Artificial Intelligence (AREA)
  • Bioethics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

L'invention concerne des procédés et des systèmes d'analyse d'ARN acellulaire (ARNcf) qui comprend le séquençage de l'ARNcf à partir d'un échantillon biologique provenant d'un sujet afin de produire des lectures de séquence d'ARNcf ; et l'alignement et la quantification des lectures de séquence d'ARNcf à l'aide d'une annotation de transcriptome sensible à la répétition ou d'une annotation de génome sensible à la répétition. L'alignement et la quantification des lectures de séquence d'ARNcf peuvent comprendre l'alignement de la séquence d'ARNcf avec une référence annotée afin d'obtenir des annotations de gène ou de répétition pour les lectures de séquence d'ARNcf, la quantification des lectures de séquence d'ARNcf qui s'alignent sur un gène ou sur un élément de répétition, et l'agrégation de la séquence d'ARNcf alignée par répétition sur le niveau de sous-famille de répétition afin de générer des caractéristiques d'ARNcf sensibles à la répétition pour une classification de maladie à l'aide d'un apprentissage automatique.
PCT/US2023/027043 2022-07-06 2023-07-06 Profilage sensible à la répétition d'arn acellulaire WO2024010875A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US202263367766P 2022-07-06 2022-07-06
US63/367,766 2022-07-06

Publications (1)

Publication Number Publication Date
WO2024010875A1 true WO2024010875A1 (fr) 2024-01-11

Family

ID=89454059

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2023/027043 WO2024010875A1 (fr) 2022-07-06 2023-07-06 Profilage sensible à la répétition d'arn acellulaire

Country Status (1)

Country Link
WO (1) WO2024010875A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118658533A (zh) * 2024-08-16 2024-09-17 西安理工大学 一种基于自监督学习的piRNA与疾病关联关系的识别方法

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140114582A1 (en) * 2012-10-18 2014-04-24 David A. Mittelman System and method for genotyping using informed error profiles
US20170121763A1 (en) * 2015-11-03 2017-05-04 Asuragen, Inc. Methods for nucleic acid size detection of repeat sequences
US20210098078A1 (en) * 2019-08-01 2021-04-01 Tempus Labs, Inc. Methods and systems for detecting microsatellite instability of a cancer in a liquid biopsy assay
US20210174958A1 (en) * 2018-04-13 2021-06-10 Freenome Holdings, Inc. Machine learning implementation for multi-analyte assay development and testing
WO2021110987A1 (fr) * 2019-12-06 2021-06-10 Life & Soft Procédés et appareils permettant de diagnostiquer un cancer à partir d'acides nucléiques acellulaires

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140114582A1 (en) * 2012-10-18 2014-04-24 David A. Mittelman System and method for genotyping using informed error profiles
US20170121763A1 (en) * 2015-11-03 2017-05-04 Asuragen, Inc. Methods for nucleic acid size detection of repeat sequences
US20210174958A1 (en) * 2018-04-13 2021-06-10 Freenome Holdings, Inc. Machine learning implementation for multi-analyte assay development and testing
US20210098078A1 (en) * 2019-08-01 2021-04-01 Tempus Labs, Inc. Methods and systems for detecting microsatellite instability of a cancer in a liquid biopsy assay
WO2021110987A1 (fr) * 2019-12-06 2021-06-10 Life & Soft Procédés et appareils permettant de diagnostiquer un cancer à partir d'acides nucléiques acellulaires

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
REGGIARDO ROMAN E., MAROLI SREELAKSHMI VELANDI, PEDDU VIKAS, DAVIDSON ANDREW E., HILL ALEXANDER, LAMONTAGNE ERIN, AARAJ YASSMIN AL: "Profiling of repetitive RNA sequences in the blood plasma of patients with cancer", NATURE BIOMEDICAL ENGINEERING, vol. 7, no. 12, pages 1627 - 1635, XP093128499, ISSN: 2157-846X, DOI: 10.1038/s41551-023-01081-7 *
SLOMOVIC SHIMYN, PORTNOY VICTORIA, SCHUSTER GADI: "Detection and Characterization of Polyadenylated RNA in Eukarya, Bacteria, Archaea, and Organelles", RNA TURNOVER IN BACTERIA, ARCHAEA AND ORGANELLES, vol. 447, 1 January 2008 (2008-01-01), pages 501 - 520, XP093128492, ISSN: 0076-6879, ISBN: 978-0-12-374377-0, DOI: 10.1016/S0076-6879(08)02224-6 *
YANG WAN R, ARDELJAN DANIEL, PACYNA CLARISSA N, PAYER LINDSAY M, BURNS KATHLEEN H: "SQuIRE reveals locus-specific regulation of interspersed repeat expression", NUCLEIC ACIDS RESEARCH, vol. 47, no. 5, 18 March 2019 (2019-03-18), GB , pages 1 - 16, XP093128482, ISSN: 0305-1048, DOI: 10.1093/nar/gky1301 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118658533A (zh) * 2024-08-16 2024-09-17 西安理工大学 一种基于自监督学习的piRNA与疾病关联关系的识别方法

Similar Documents

Publication Publication Date Title
JP7455757B2 (ja) 生体試料の多検体アッセイのための機械学習実装
US20240079092A1 (en) Systems and methods for deriving and optimizing classifiers from multiple datasets
CN106103744B (zh) 用于预测脓毒症发作的设备、试剂盒和方法
EP3924502A1 (fr) Structure intégrée d'apprentissage automatique pour estimer une déficience de recombinaison homologue
US20150080243A1 (en) Methods and compositions for detecting cancer based on mirna expression profiles
US20240212848A1 (en) Systems and methods for determining whether a subject has a cancer condition using transfer learning
CN115667554A (zh) 通过核酸甲基化分析检测结直肠癌的方法和系统
US20210166813A1 (en) Systems and methods for evaluating longitudinal biological feature data
WO2018140256A1 (fr) Signatures d'expression génique utiles pour prédire ou diagnostiquer une sepsie et leurs méthodes d'utilisation
US20230175058A1 (en) Methods and systems for abnormality detection in the patterns of nucleic acids
WO2013049152A2 (fr) Procédés pour évaluer le statut du cancer du poumon
CN117413072A (zh) 用于通过核酸甲基化分析检测癌症的方法和系统
WO2021061473A1 (fr) Systèmes et procédés pour diagnostiquer un état pathologique à l'aide de données de séquençage sur cible et hors cible
JP2023524016A (ja) 結腸細胞増殖性障害を特定するためのrnaマーカと方法
WO2024010875A1 (fr) Profilage sensible à la répétition d'arn acellulaire
US20220101135A1 (en) Systems and methods for using a convolutional neural network to detect contamination
Emmert-Streib Statistical diagnostics for cancer: analyzing high-dimensional data
US20240363197A1 (en) Methods for characterizing infections and methods for developing tests for the same
WO2024192121A1 (fr) Détection d'une contamination par des globules blancs

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23836100

Country of ref document: EP

Kind code of ref document: A1