WO2020061643A1 - Expression profiling - Google Patents

Expression profiling Download PDF

Info

Publication number
WO2020061643A1
WO2020061643A1 PCT/AU2019/051049 AU2019051049W WO2020061643A1 WO 2020061643 A1 WO2020061643 A1 WO 2020061643A1 AU 2019051049 W AU2019051049 W AU 2019051049W WO 2020061643 A1 WO2020061643 A1 WO 2020061643A1
Authority
WO
WIPO (PCT)
Prior art keywords
sample
sequences
expression profile
profile
profiles
Prior art date
Application number
PCT/AU2019/051049
Other languages
French (fr)
Inventor
Dennis BUNADI
Martin Smith
James Ferguson
Shaun CARSWELL
Original Assignee
Garvan Institute Of Medical Research
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from AU2018903657A external-priority patent/AU2018903657A0/en
Application filed by Garvan Institute Of Medical Research filed Critical Garvan Institute Of Medical Research
Publication of WO2020061643A1 publication Critical patent/WO2020061643A1/en

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • G16B25/10Gene or protein expression profiling; Expression-ratio estimation or normalisation
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/10Signal processing, e.g. from mass spectrometry [MS] or from PCR
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6869Methods for sequencing
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6876Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
    • C12Q1/6883Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N2800/00Detection or diagnosis of diseases
    • G01N2800/26Infectious diseases, e.g. generalised sepsis

Definitions

  • This disclosure relates to a method for determining a state of a biological sample using streaming data from a sequencer, such as, but not limited to, diagnosing sepsis using sequencing data.
  • transcriptome represents a snapshot of global genetic activity from a single cell or a population of cells (e.g. a tissue), which can be decomposed into thousands of individual genes and gene products that are each produced (or expressed) at different levels.
  • the nature and relative quantities of expressed genes is very dynamic and varies in function of‘cellular states’, e.g. tissue-specificity, developmental processes, differentiation, disease, drugs, and environment.
  • cellular states e.g. tissue-specificity, developmental processes, differentiation, disease, drugs, and environment.
  • sequencing datasets are generally large so that an upload of the full dataset generally requires a long time, such as three days. For many diagnostic applications, especially emergency applications, this is unacceptably long.
  • a method for determining a state of a biological sample using streaming data from a sequencer comprises:
  • an expression profile for the sample comprising for each of the multiple sequences an indication of abundance of that sequence in the sample
  • the method comprises:
  • an expression profile for the sample comprising for each of the multiple sequences an indication of abundance of that sequence in the sample
  • Fig. 1 illustrates a sorted X-profile being generated using nanopore sequencing and a database of previously generated X-profiles against which the native X-profile is compared to.
  • Fig. 2 illustrates an example of comparative X-profiles for determining tissue of origin.
  • FIG. 3 illustrates an example of X-profile comparison approach.
  • Fig. 4 illustrates a Comparison of unknown sample to known samples.
  • Mouse RNAseq data from a blind sample (Sample X) was used to generate progressively larger X-profiles, which are compared to 3 reference X-profiles form known tissues (Brain, Kidney, Testes).
  • Sample X was predicted to be mouse brain, which was subsequently confirmed by the technician who produced the sample.
  • Fig. 5 illustrates a method for diagnosis of sepsis in a sample from a patient.
  • Fig. 6 illustrates method for determining a state of a biological sample.
  • Nanopore sequencing enables real-time analysis of genomic
  • transcriptomic data In particular, the real-time acquisition of data enables interactive, selective sequencing applications premised on instantaneous analysis of sequencing data.
  • a molecule can be ejected by reversing the flow of current across the nanopore if the analysis of the sequence reveals it to be undesired. Conversely, the molecule may continue to be sequenced if analysis of the sequence reveals it to be desirable.
  • Oxford Nanopore Technologies have pioneered such applications with their‘read-untiT functionality.
  • RNA sequencing a.k.a. transcriptomics
  • mRNA sequences of the same genes some highly-expressed genes compose the majority of mRNA sequences in a transcriptome. These abundant molecules can saturate a sequencing experiment, and provide little qualitative information after an initial subset of sequencing reads have been generated. It is thus desirable to reject these reads once they have been sequenced sufficiently to determine the composition and diversity of their primary structure.
  • less abundant transcripts, such as regulatory ncRNAs can provide distinguishing information about the nature of a sample.
  • retaining the relative abundances of all transcripts can nonetheless provide distinguishing information about the nature of the sample.
  • This disclosure provides a method to characterize cellular states by generating qualitative and quantitative expression profiles (X-profiles) using a data format compatible with real-time nanopore sequencing.
  • X-profiles for processing transcriptomic data in real-time, including the comparative analysis of X-profiles.
  • comparative X-profile analysis can be used to identify the source of an unknown RNA sequencing sample by comparing it to a database of annotated X-profiles.
  • This approach can be extended to clinical applications, such as the identification of tissue of origin for metastatic cancers of unknown primary (CUPs), or the stratification of sepsis patients based on signatures of gene expression (i.e.
  • X-profiles enables real-time comparisons to other X-profiles generated a priori, enabling real-time classification of biological and clinical samples, which can drastically reduce the turnaround time for clinical tests.
  • An“expression profile” is a database that stores biological sequencing information in signal form, alongside a quantification of said signal abundance as described in PCT/AU2018/050265, which is incorporated herein by reference.
  • An X-profile can be sorted by the relative abundance (i.e. quantification of signal), most common to less common
  • Collections of expression profiles for disparate tissue / sample types may be loaded into cloud-computing instances, allowing comparisons between expression profiles to determine match similarity via rank correlation.
  • a processor of a computing system receives multiple sequences of a sample from the sequencer, such as in the form of a file generated by the sequencer.
  • Each sequence can be considered as being a‘read’, that is, one contiguous stream of sequencing data, noting that for nanopore sequencing the reads are relatively long compared to Illumina sequencing, for example.
  • the processor then generates an expression profile for the sample.
  • the expression profile comprising for each of the multiple sequences an indication of abundance of that sequence in the sample.
  • Fig. 1 illustrates an expression profile (X-profile) 101, which is sorted in this example.
  • the solid bars in each row of profile 101 indicate the abundance of that sequence in the sense that longer bars indicate a higher number of sequences being read.
  • the processor has generated the profile 101 using nanopore sequencing . It is noted that at the moment in time of Fig . 1 , the profile 101 is not complete yet but rather‘work in progress’ as the processor is building the profile 101 because the entire sequencing data has not yet been received. In this sense, profile 101 could be referred to as partial, incomplete, fragmentary or unfinished. Nevertheless, the processor 101 can already use the partial or intermediate profile 101 as described below.
  • the processor receives further sequences 106 as streaming data from the sequencer as shown at the left hand side of Fig. 1. While the processor receives the further sequences 106, the processor performs the steps below. This means that the processor may perform the below steps during the sequencing, as the signal or the individual bases arrive at the processor, or at the end of each read where the profile 101 is updated or after every 10 or 100 reads. Importantly, processor performs the below steps multiple times before the entire sequencing data is available.
  • the steps repeated by the processor include updating the expression profile 101 for the sample, so that the stored abundances reflect the number of reads received so far for each stored read.
  • the processor then performs a comparison of the expression profile 101 for the sample to a stored expression profile (103, 104, 105), noting that the stored profiles 103, 104, 105 are associated with a respective predefined state of the sample. For example, the profile may be indicative of an abundance of sequences when sepsis is present.
  • the processor determines the state of the sample as the state associated with the matching stored expression profile. For example, when the stored sepsis profile matches with the current profile 101, the processor determines that the patient has sepsis. Importantly, upon determining the state of the sample (i.e. sepsis is present), the processor terminates the receiving of the further sequences.
  • the database can be reduced to only retain
  • X-profiles can be extended with other features arising from the signal that can feed into a maximum likelihood model or classifier system, including but not limited to transformations of the signal from the time domain to the frequency domain, signal time series averages, peak co-ordinates, auto-correlates, zero-crossing derivative vectors, etc. see Fig. 2. While Fig. 2 provides some examples of features (events, FFT, PSD, Matched signal abundance), a combination of those or others not mentioned here may equally be used. In one example, the method uses a model for each tissue of interest, or biological data signatures in k-mer space.
  • X-profiles can be generated using different sequencing technologies and can be converted between formats.
  • Public RNA sequencing datasets using the Illumina short read platform are plentiful in repositories such as TCGA, GTEx,
  • MiTranscriptome etc.
  • An example of how they can be used to generate X-profiles follows:
  • X-profiles can also be converted between formats, sequencing technologies, platforms, or data sets, enabling the generation of a normalized, unified and centralized database of gene expression profiles.
  • an X-profile generated with sequence information as the qualitative feature can be converted to signal features using a tool like Scrappie or DeepSimulator, which convert between sequence and nanopore signal data, in this example.
  • the abundances from the original profiles can thus be interchangeable across datasets of different qualitative natures, facilitating normalization across different sequencing platforms.
  • One or more X-profile can be used to generate a representative X-profile for a given sample, tissue, biological or physical feature of interest. For example, two or more X-profiles can be merged by creating a meta X-profile that represents a consensus of the two or more profiles. Similarly, two or more X-profiles can be merged by extracting the common or discriminative profiles.
  • the method subtracts the mean, divide by the standard deviation of the residuals - compare like to like.
  • query X-profiles is normalized against reference X- profiles. This can, for example, be done by subtracting the mean and dividing by the standard deviation of the residuals, or as another example, map the bounds between
  • Fig. 3 illustrates how the processor compares two expression profiles 301 and 302.
  • the processor takes two expression profiles - A 101 and B 103. Each profile is ordered by descending abundance.
  • the processor then takes the first signal in A 101 and compares it to the first signal in B by applying a signal comparison function as indicated by the arrows in Fig. 3. If the very first signals match, it can be said that A rank 1 matches to B rank 1, resulting in a score of 1. If they do not match, processor continues comparing for A’s next N neighbors in B (if no match, then N+l rank scoring penalty).
  • the first signal in A 101 matches to the sixth signal in B 103, which results in a score of 6.
  • the second signal in A 101 matches with the fifth signal in B 103 resulting in a score of 5 and for the third signal in A a score of 3.
  • First (top) the most abundant sequence/signal from X-profile A is compared to the most abundant seq/signal from X-profile B. The rank of a‘match’ is returned. Same for the 2nd (middle) and 3rd most abundant signal (bottom) from X-profile A.
  • a less similar X-profile C would produce an ABscore >>14, while a more similar one ⁇ 14.
  • the result is a vector of rank-matches between A & B - A has a natural vector (just the indices ordered by abundance), while we’ve returned the vector of B in relation to A.
  • the stored signal data can be obtained directly from a sequencing machine (e.g. Oxford Nanopore devices such as MinlON, GridlON, PromethlON, etc.) or indirectly by taking sequence data in basespace, such as generated by short read sequencing (Illumina), or from transcriptome annotations generated from de novo assembly of data, or cDNA sequencing using other technologies, and converting the nucleotide sequence into a similar‘squiggle’ signal format, , with tools like
  • a sequencing machine e.g. Oxford Nanopore devices such as MinlON, GridlON, PromethlON, etc.
  • sequence data in basespace such as generated by short read sequencing (Illumina), or from transcriptome annotations generated from de novo assembly of data, or cDNA sequencing using other technologies, and converting the nucleotide sequence into a similar‘squiggle’ signal format, , with tools like
  • the model can be included with the SQUID DB for different samples / tissues, so that we can extract features from newly sequenced signals and classify them according to our trained models.
  • 93/fasta/mus museu3 ⁇ 4us/cdna/Mus muscuius.GRCm38.cdna.all.fa) is used as a database entry (e.g. the first column/qualitative feature of the X-profile examples above);
  • a fourth X-profile (sample X) was then generated using increasing amount of reads.
  • a first X-profile was generated as described above with the first 1000 base called reads from sample X (Xp-lk), then compared to samples B, K, & T using a rank sum correlation. The respective values are plotted in Figure 4.
  • a second X-profile (Xp-lOk) was then generated by sampling a further 9000 base called reads from sample X (10,000 total abundance) and adding them to Xp-lk.
  • Xp-lOk was then compared to the 3 X-profiles from known samples as previously described, and plotted in Figure 4.
  • Sample X can rapidly be classified as Sample B, or brain tissue, by comparing the relative similarity scores (here, the rank sum correlation) across reference X-profiles.
  • a final X-profile (Xp-F) including all base called sequences from sample X was compared to the 3 X-profiles from known samples, generating a match to sample B (brain) with a P-value of 0.02 (Tau test, t ⁇ 0.1). This result was found to be discriminatory, as matches to the X-Profiles of the other tissues did not result in a significantly correlated ranking (t ⁇ 0.1, P-values > 0.65).
  • Sequence patients with and without sepsis to generate X-profiles labelled for clinical data such as severity of infection, nature of pathogen, source of infection, patient age, health outcomes, demographics, date;
  • any other tumour can be compared to previously sequenced
  • tumours to find a match tumours to find a match.
  • the profiles are formatted such that they are compatible with a real-time processing of the sequencing data stream. That is, the sequencing signal is received and while the sequencing signal is being received (before the full data is available), a diagnosis can be made by the proposed method.
  • the indication of abundances in the profiles is continuously updated and after every update or periodically (such as every minute or every 5 minutes) the profile is matched against the stored profiles.
  • one of the stored profiles may be the typical profile of a sepsis patient and a good match indicates sepsis as a diagnosis and treatment can be commenced straight away and within a short time window, such as within 10 minutes or within 30 minutes. This also means that the receiving of the sequencing data can be stopped before the full data has been received and as soon as a diagnosis has been provided.
  • the data stream is processed in real time, while the stream is being generated.
  • a whole genome sequencing such as Illumina sequencing may be performed off-site but the dataset is too large to transmit via a relatively slow internet connection. For example, it may take three days to transmit the entire dataset which is too long for some diagnoses, such as sepsis.
  • the sequences are ordered by abundance and the matching score represents the difference in the position of the sequence within the ordered sequences, because the most abundant sequences are likely to be sequenced at larger numbers early and therefore provide a robust diagnosis.
  • the diagnosis is performed based on the most abundant (i.e. most accurate) sequences.
  • the comparison between profiles is not performed on all available sequences but only on the top most abundant sequences (such as top 10 or top 100 sequences).
  • the analysis i.e. receiving of further sequences
  • the analysis is stopped as soon as the threshold is met. For example, where a higher matching score indicates a worse match, the analysis is stopped as soon as the matching score is below the threshold (such as 100 in the example of Fig. 3).
  • sequences may comprise base calls, it is also possible that they comprise a time domain electrical signal, also referred to as squiggle, which may be indicative of the current through a nanopore while the bases pass through the nanopore.
  • squiggle time domain electrical signal
  • the advantage of using squiggles is that it is not necessary to call bases from the squiggle (i.e. convert the squiggle into sequence), which speeds up the process and increases reliability as approximations are removed. It is possible to used BLAST, minimap2, for sequence matching instead of DWT for squiggle matching.
  • the method described herein is performed by a computer system comprising an input port to receive the sequences (such as USB) and a processor to create/update the expression profiles and the compare the expression profile against the database.
  • the database may be local or remote and the comparison (i.e. calculating a matching score) may be performed remotely, such as in a cloud computing
  • the bandwidth required for the cloud computing implementation is minimal because it is not necessary to upload the entire sequencing data set at once but only as it is generated by the sequencer. In that case, the library of expression profdes would also be stored in the cloud and matched there. This allows the use of relatively large libraries without the need for local data storage and without the need for full transfer of the entire sequencing data set as an upload from the sequencer. This has the significant technical advantage that the analysis of the sequencing data can be performed much faster because it is not necessary to wait for the upload to finish.
  • Fig. 5 illustrates a method 500 for diagnosis of sepsis in a sample from a patient using streaming data from a sequencer.
  • the method comprises receiving 501 multiple sequences of the sample from the sequencer and generating 502 an expression profile for the sample.
  • the expression profile comprises for each of the multiple sequences an indication of abundance of that sequence in the sample.
  • Method 500 also comprises receiving 503 further sequences as streaming data from the sequencer and while receiving 504 the further sequences, the method 500 comprises performing the steps of:
  • Fig. 6 illustrates method 600 for determining a state of a biological sample using streaming data from a sequencer.
  • Method 600 comprises receiving 601 multiple sequences of the sample from the sequencer and generating 602 an expression profile for the sample, the expression profile comprising for each of the multiple sequences an indication of abundance of that sequence in the sample.
  • Method 600 further comprises receiving further sequences as streaming data from the sequencer and while receiving 604 the further sequences performing the steps of:

Landscapes

  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Medical Informatics (AREA)
  • Biophysics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Genetics & Genomics (AREA)
  • General Health & Medical Sciences (AREA)
  • Biotechnology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Chemical & Material Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Biology (AREA)
  • Molecular Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Organic Chemistry (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Zoology (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Epidemiology (AREA)
  • Analytical Chemistry (AREA)
  • Databases & Information Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Wood Science & Technology (AREA)
  • Bioethics (AREA)
  • Microbiology (AREA)
  • Immunology (AREA)
  • Signal Processing (AREA)
  • Biochemistry (AREA)
  • General Engineering & Computer Science (AREA)
  • Apparatus Associated With Microorganisms And Enzymes (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

This disclosure relates to a method for determining a state of a biological sample using streaming data from a sequencer, such as, but not limited to, diagnosing sepsis using sequencing data. A processor generates an expression profile for the sample. The expression profile comprises for each of the multiple sequences an indication of abundance of that sequence in the sample. While the processor receives further sequences for the sample, the processor updates the expression profile for the sample, performs a comparison of the expression profile for the sample to stored expression profiles to determine a matching stored expression profile, and determines the state of the sample as the state associated with the matching stored expression profile (such as sepsis). Upon determining the state of the sample, the processor terminates the receiving of the further sequences before the full sequencing data has been received.

Description

EXPRESSION PROFILING
Related application
[0001] This application claims priority from Australian application 2018903657, filed on 27 September 2018, which is incorporated herein by reference.
Technical Field
[0002] This disclosure relates to a method for determining a state of a biological sample using streaming data from a sequencer, such as, but not limited to, diagnosing sepsis using sequencing data.
Background
[0003] The genome produces a diverse multitude of protein-coding (mRNA) and non protein coding (ncRNA) transcripts that, collectively, embody the transcriptome. A transcriptome represents a snapshot of global genetic activity from a single cell or a population of cells (e.g. a tissue), which can be decomposed into thousands of individual genes and gene products that are each produced (or expressed) at different levels. The nature and relative quantities of expressed genes is very dynamic and varies in function of‘cellular states’, e.g. tissue-specificity, developmental processes, differentiation, disease, drugs, and environment. Hence, measuring and observing transcriptomes via high-throughput sequencing provides an informative, high- resolution molecular profile (or‘snapshot’) of cellular states.
[0004] However, sequencing datasets are generally large so that an upload of the full dataset generally requires a long time, such as three days. For many diagnostic applications, especially emergency applications, this is unacceptably long.
[0005] Any discussion of documents, acts, materials, devices, articles or the like which has been included in the present specification is not to be taken as an admission that any or all of these matters form part of the prior art base or were common general knowledge in the field relevant to the present disclosure as it existed before the priority date of each claim of this application.
[0006] Throughout this specification the word "comprise", or variations such as "comprises" or "comprising", will be understood to imply the inclusion of a stated element, integer or step, or group of elements, integers or steps, but not the exclusion of any other element, integer or step, or group of elements, integers or steps.
Summary
[0007] Disclosed herein is a method for analysing sequences by matching the abundances (i.e. expression levels) against known profiles. This is achieved without the entire sequencing data-set but on the fly as the sequences become available. Once a match is found, the process can be stopped, which results in a significantly reduced time required to come to a decision.
[0008] In this sense, there is provided a method for determining a state of a biological sample using streaming data from a sequencer. The method comprises:
receiving multiple sequences of the sample from the sequencer;
generating an expression profile for the sample, the expression profile comprising for each of the multiple sequences an indication of abundance of that sequence in the sample;
receiving further sequences as streaming data from the sequencer;
while receiving the further sequences performing the steps of:
updating the expression profile for the sample;
performing a comparison of the expression profile for the sample to one or more stored expression profiles to determine a matching stored expression profile, each of the one or more stored expression profiles being associated with a respective predefined state of the sample;
determining the state of the sample as the state associated with the matching stored expression profile; and upon determining the state of the sample terminating the receiving of the further sequences.
[0009] There is also provided a method for determining a state of a biological sample using streaming data from a sequencer. The method comprises:
receiving multiple sequences of the sample from the sequencer;
generating an expression profile for the sample, the expression profile comprising for each of the multiple sequences an indication of abundance of that sequence in the sample;
receiving further sequences as streaming data from the sequencer;
while receiving the further sequences performing the steps of:
updating the expression profile for the sample;
ordering the sequences in the expression profile for the sample by the respective abundances;
performing a comparison of the expression profile for the sample to one or more stored expression profiles based on a difference in a position of a sequence within the ordered sequences between the expression profiles to determine a matching stored expression profile, each of the one or more stored expression profiles being associated with a respective predefined state of the sample and ordered by the respective abundances;
determining the state of the sample as the state associated with the matching stored expression profile; and
upon determining the state of the sample terminating the receiving of the further sequences.
Brief Description of Drawings
[0010] An example will now be provided with reference to the following drawings:
[0011] Fig. 1 illustrates a sorted X-profile being generated using nanopore sequencing and a database of previously generated X-profiles against which the native X-profile is compared to. [0012] Fig. 2 illustrates an example of comparative X-profiles for determining tissue of origin.
[0013] Fig. 3 illustrates an example of X-profile comparison approach.
[0014] Fig. 4 illustrates a Comparison of unknown sample to known samples. Mouse RNAseq data from a blind sample (Sample X) was used to generate progressively larger X-profiles, which are compared to 3 reference X-profiles form known tissues (Brain, Kidney, Testes). Sample X was predicted to be mouse brain, which was subsequently confirmed by the technician who produced the sample.
[0015] Fig. 5 illustrates a method for diagnosis of sepsis in a sample from a patient.
[0016] Fig. 6 illustrates method for determining a state of a biological sample.
Description of Embodiments
[0017] Nanopore sequencing enables real-time analysis of genomic and
transcriptomic data. In particular, the real-time acquisition of data enables interactive, selective sequencing applications premised on instantaneous analysis of sequencing data. A molecule can be ejected by reversing the flow of current across the nanopore if the analysis of the sequence reveals it to be undesired. Conversely, the molecule may continue to be sequenced if analysis of the sequence reveals it to be desirable. Oxford Nanopore Technologies have pioneered such applications with their‘read-untiT functionality.
[0018] For RNA sequencing (a.k.a. transcriptomics) it can be beneficial to selectively reject abundant and highly similar transcripts, such as mRNA sequences of the same genes. Indeed, some highly-expressed genes compose the majority of mRNA sequences in a transcriptome. These abundant molecules can saturate a sequencing experiment, and provide little qualitative information after an initial subset of sequencing reads have been generated. It is thus desirable to reject these reads once they have been sequenced sufficiently to determine the composition and diversity of their primary structure. Indeed, less abundant transcripts, such as regulatory ncRNAs, can provide distinguishing information about the nature of a sample. However, retaining the relative abundances of all transcripts can nonetheless provide distinguishing information about the nature of the sample.
[0019] This disclosure provides a method to characterize cellular states by generating qualitative and quantitative expression profiles (X-profiles) using a data format compatible with real-time nanopore sequencing. We describe the utility of X-profiles for processing transcriptomic data in real-time, including the comparative analysis of X-profiles. We demonstrate how comparative X-profile analysis can be used to identify the source of an unknown RNA sequencing sample by comparing it to a database of annotated X-profiles. This approach can be extended to clinical applications, such as the identification of tissue of origin for metastatic cancers of unknown primary (CUPs), or the stratification of sepsis patients based on signatures of gene expression (i.e.
‘cellular states’). Furthermore, the nature of X-profiles enables real-time comparisons to other X-profiles generated a priori, enabling real-time classification of biological and clinical samples, which can drastically reduce the turnaround time for clinical tests.
[0020] An“expression profile” (X-profile) is a database that stores biological sequencing information in signal form, alongside a quantification of said signal abundance as described in PCT/AU2018/050265, which is incorporated herein by reference. An X-profile can be sorted by the relative abundance (i.e. quantification of signal), most common to less common [Fig 1] Collections of expression profiles for disparate tissue / sample types may be loaded into cloud-computing instances, allowing comparisons between expression profiles to determine match similarity via rank correlation. A processor of a computing system receives multiple sequences of a sample from the sequencer, such as in the form of a file generated by the sequencer. Each sequence can be considered as being a‘read’, that is, one contiguous stream of sequencing data, noting that for nanopore sequencing the reads are relatively long compared to Illumina sequencing, for example. The processor then generates an expression profile for the sample. The expression profile comprising for each of the multiple sequences an indication of abundance of that sequence in the sample.
[0021] Fig. 1 illustrates an expression profile (X-profile) 101, which is sorted in this example. The solid bars in each row of profile 101 indicate the abundance of that sequence in the sense that longer bars indicate a higher number of sequences being read. In this example, the processor has generated the profile 101 using nanopore sequencing . It is noted that at the moment in time of Fig . 1 , the profile 101 is not complete yet but rather‘work in progress’ as the processor is building the profile 101 because the entire sequencing data has not yet been received. In this sense, profile 101 could be referred to as partial, incomplete, fragmentary or unfinished. Nevertheless, the processor 101 can already use the partial or intermediate profile 101 as described below. There is also a database 102 of previously generated X-profiles 103, 104, 105 against which the native X-profile is compared to.
[0022] In this sense, the processor receives further sequences 106 as streaming data from the sequencer as shown at the left hand side of Fig. 1. While the processor receives the further sequences 106, the processor performs the steps below. This means that the processor may perform the below steps during the sequencing, as the signal or the individual bases arrive at the processor, or at the end of each read where the profile 101 is updated or after every 10 or 100 reads. Importantly, processor performs the below steps multiple times before the entire sequencing data is available.
[0023] The steps repeated by the processor include updating the expression profile 101 for the sample, so that the stored abundances reflect the number of reads received so far for each stored read. The processor then performs a comparison of the expression profile 101 for the sample to a stored expression profile (103, 104, 105), noting that the stored profiles 103, 104, 105 are associated with a respective predefined state of the sample. For example, the profile may be indicative of an abundance of sequences when sepsis is present. The processor then determines the state of the sample as the state associated with the matching stored expression profile. For example, when the stored sepsis profile matches with the current profile 101, the processor determines that the patient has sepsis. Importantly, upon determining the state of the sample (i.e. sepsis is present), the processor terminates the receiving of the further sequences.
[0024] Furthermore, the database can be reduced to only retain
features/sequences/entries of noted significance or interest. Conversely, X-profiles can be extended with other features arising from the signal that can feed into a maximum likelihood model or classifier system, including but not limited to transformations of the signal from the time domain to the frequency domain, signal time series averages, peak co-ordinates, auto-correlates, zero-crossing derivative vectors, etc. see Fig. 2. While Fig. 2 provides some examples of features (events, FFT, PSD, Matched signal abundance), a combination of those or others not mentioned here may equally be used. In one example, the method uses a model for each tissue of interest, or biological data signatures in k-mer space.
[0025] X-profiles can be generated using different sequencing technologies and can be converted between formats. Public RNA sequencing datasets using the Illumina short read platform are plentiful in repositories such as TCGA, GTEx,
MiTranscriptome, etc. An example of how they can be used to generate X-profiles follows:
1. Generate reference transcripts to be used as qualitative X-profile features from ab initio transcriptome assembly of short reads using tools such as Trinity, TransAbyss, SeqMan NGen, SOAPdenovo-Trans, Velvet/Oases, etc.
Comparably, this can be done with de novo assembly tools such as Cufflinks, String-tie, etc.
2. Quantification of the assembly with tools like Kallisto, Salmon, Sailfish,
HTScount, generating abundances for the assembled sequence;
3. Sorting the sequences by decreasing abundance. Another example:
1. Extract sequences corresponding to CAGEseq peaks from FANTOM5,
representing the 5’ end of mRNAs, as qualitative feature of X-profile;
2. Assign CAGE peak abundance as quantitative feature of extracted mRNA
sequence into X-profile;
3. Sort by decreasing abundance;
[0026] X-profiles can also be converted between formats, sequencing technologies, platforms, or data sets, enabling the generation of a normalized, unified and centralized database of gene expression profiles. For example, an X-profile generated with sequence information as the qualitative feature can be converted to signal features using a tool like Scrappie or DeepSimulator, which convert between sequence and nanopore signal data, in this example. The abundances from the original profiles can thus be interchangeable across datasets of different qualitative natures, facilitating normalization across different sequencing platforms.
[0027] One or more X-profile can be used to generate a representative X-profile for a given sample, tissue, biological or physical feature of interest. For example, two or more X-profiles can be merged by creating a meta X-profile that represents a consensus of the two or more profiles. Similarly, two or more X-profiles can be merged by extracting the common or discriminative profiles.
Comparing expression profiles
Normalize signals across samples
[0028] In one example the method subtracts the mean, divide by the standard deviation of the residuals - compare like to like. Alternatively map the bounds between [0,1] To compare X-profiles, query X-profiles is normalized against reference X- profiles. This can, for example, be done by subtracting the mean and dividing by the standard deviation of the residuals, or as another example, map the bounds between
[0, 1]·
Nearest neighbor rank correlation of profiles
[0029] The table below provides an example where each row from profile 101 is annotated with the best matching row from profile 102 with the corresponding rank. The number values in the table below do not directly correspond to the example in Fig. 1 but constitute a different example.
Figure imgf000011_0001
[0030] Fig. 3 illustrates how the processor compares two expression profiles 301 and 302. The processor takes two expression profiles - A 101 and B 103. Each profile is ordered by descending abundance. The processor then takes the first signal in A 101 and compares it to the first signal in B by applying a signal comparison function as indicated by the arrows in Fig. 3. If the very first signals match, it can be said that A rank 1 matches to B rank 1, resulting in a score of 1. If they do not match, processor continues comparing for A’s next N neighbors in B (if no match, then N+l rank scoring penalty).
[0031] In the example of Fig. 3, the first signal in A 101 matches to the sixth signal in B 103, which results in a score of 6. The second signal in A 101 matches with the fifth signal in B 103 resulting in a score of 5 and for the third signal in A a score of 3. [0032] First (top) the most abundant sequence/signal from X-profile A is compared to the most abundant seq/signal from X-profile B. The rank of a‘match’ is returned. Same for the 2nd (middle) and 3rd most abundant signal (bottom) from X-profile A. In Fig. 3, the AB rank sum for this example (top 3 from profile A) would be 6+5+3=14. A less similar X-profile C would produce an ABscore >>14, while a more similar one <14.
[0033] Pseudo-code:
product (readsl , reads2) :
Figure imgf000012_0001
distance = rilpy.dtw std (reads! [readl] , reads2 [read2 ] ,
d:i3t__oniy-True }
if distance within THRESH and not exceed N TRIES :
signals match read2 idx. append [reads! . index {read } ] break
: : ht end have toy· X ranks fox reads, reac2 can rank correlate / rank sen etc:
[0034] The result is a vector of rank-matches between A & B - A has a natural vector (just the indices ordered by abundance), while we’ve returned the vector of B in relation to A.
[0035] It is then possible to apply rank correlation coefficient (Kendall Tau, etc.) to assess the ordinal association between the profiles e.g. are the transcript abundances of these signals together, measured by tau + p-value.
Sequence data
[0036] The stored signal data can be obtained directly from a sequencing machine (e.g. Oxford Nanopore devices such as MinlON, GridlON, PromethlON, etc.) or indirectly by taking sequence data in basespace, such as generated by short read sequencing (Illumina), or from transcriptome annotations generated from de novo assembly of data, or cDNA sequencing using other technologies, and converting the nucleotide sequence into a similar‘squiggle’ signal format, , with tools like
DeepSimulator or Scrappie (Mozilla Public License Version 2.0)
Figure imgf000013_0001
Approaches to quantifying signal abundance / signal comparison
Dynamic time warping
[0037] Mapping signal-to-signal alignments via dynamic time warping (DTW) - 0(N 1N2) where N is the length of a sequence, noting that this example relates to long reads. Additionally, discrepancy between sampling rate of electrical current measurements versus speed of molecule passing through the pore. The DTW distance between two signals below a bootstrapped threshold constitutes a match, and if no matching signal present in the SQUID DB signal is recorded in the SQUID DB with a corresponding integer count, otherwise if matching signal found then increment the abundance count.
Machine learning via signal processing and feature extraction
[0038] After obtaining the signal, it is possible to clean the signal through a 1D wavelet filter, recapitulate the signal, apply signal processing techniques to the regenerated signal (fast Fourier transform, power spectral density, auto-correlate, etc) to obtain a feature set of the signal. These features, when used with well-labelled, accurate training sets (can be on Illumina data transformed into signal-space) can be used as input for a classifier / model in established ML techniques.
[0039] The model can be included with the SQUID DB for different samples / tissues, so that we can extract features from newly sequenced signals and classify them according to our trained models.
• Extract signal
• 1D wavelets -> de-noised signal, FFT, PSD, AC, (x,y) co-ordinates for peaks
• Build feature set • Train model
• Assess model ability to differentiate signals Application example
Mouse tissue
[0040] RNA was extracted from 3 mouse tissues (Brain, Kidney, Testes) and 4 samples were sequenced: one from each tissue and one unknown sample (blind control). Each sample was sequenced on 4 Oxford Nanopore Mini ON R9.4.1 flowcells using a cDNA + PCR library preparation protocol.
• Mouse brain read count: 983,348
• Mouse kidney read count: 875,066
• Mouse testes read count: 1,749,002
• Blind control read count: 706,115
[0041] Base called data (sequences) for the 3 known samples (samples B, K & T) were used to generate X-profiles as follows:
1. Each reference sequence of the mouse reference transcriptome
(ftp : //ftp. ensembi.org/pub/reiease··
93/fasta/mus museu¾us/cdna/Mus muscuius.GRCm38.cdna.all.fa) is used as a database entry (e.g. the first column/qualitative feature of the X-profile examples above);
2. Sequences were aligned to database entries using the Minimap2 software;
3. The most similar database entry to a base called sequence as determined by Minimap2 has the associated counter incremented;
4. Repeat (2.) until all base called sequences have been aligned.
5. Sort the database entries decreasingly by their abundance.
[0042] A fourth X-profile (sample X) was then generated using increasing amount of reads. A first X-profile was generated as described above with the first 1000 base called reads from sample X (Xp-lk), then compared to samples B, K, & T using a rank sum correlation. The respective values are plotted in Figure 4.
[0043] A second X-profile (Xp-lOk) was then generated by sampling a further 9000 base called reads from sample X (10,000 total abundance) and adding them to Xp-lk. Xp-lOk was then compared to the 3 X-profiles from known samples as previously described, and plotted in Figure 4.
[0044] This was also performed for 50k and lOOk total reads. The increasing size of the presented X-profiles represents a growing X-profile during the acquisition of streaming data, such as produced by real-time sequencing platforms. As demonstrated in Fig. 4, Sample X can rapidly be classified as Sample B, or brain tissue, by comparing the relative similarity scores (here, the rank sum correlation) across reference X-profiles.
[0045] A final X-profile (Xp-F) including all base called sequences from sample X was compared to the 3 X-profiles from known samples, generating a match to sample B (brain) with a P-value of 0.02 (Tau test, t ~ 0.1). This result was found to be discriminatory, as matches to the X-Profiles of the other tissues did not result in a significantly correlated ranking (t ~ 0.1, P-values > 0.65).
[0046] Other application examples
[0047] Sepsis stratification
(1) Sequence patients with and without sepsis to generate X-profiles, labelled for clinical data such as severity of infection, nature of pathogen, source of infection, patient age, health outcomes, demographics, date;
(2) Classify X-profiles into reference categories based on discriminatory features of interest, such as acute sepsis versus non-sepsis profiles;
(3) Sequence blood of a patient with unknown status to generate a X-profile in real time; (4) Compare X-profile generated in real-time to reference X-profiles to determine the most similar category;
(5) Use comparative X-profde scores to make a clinical diagnosis, stratification of patient risk, or treatment recommendation.
[0048] Another example: (Cancer of unknown primary)
(1) Generate X-profiles for normal human tissues and tumours;
(2) Sequence a biopsy a carcinoma of unknown primary (a metastatic tumour with an unknown tissue of origin);
(3) Compare X-profile from biopsy to X-profiles from normal tissues to identify tissue of origin and help guide subsequent treatment.
(4) Alternatively, any other tumour can be compared to previously sequenced
tumours to find a match.
[0049] Another example: (Sample identification/validation)
(1) Generate X-profiles of various cell lines to validate or identify cell lines or contamination of cell lines.
[0050] General case:
(1) Can be used to test a query transcription profile against a set of reference
profiles to identify some transcription level difference/similarity.
(2) Can be used to assess a change in transcription, for example, by the host
response to a disease, a pathogen, or a treatment.
Identifying tissue of origin for cancers of unknown primary
[0051] By efficient/accurate iterative clustering of nanopore read raw signal binned on similarity via (dtw/CNN/hashing/metric), we performed long read quantification of RNA sequencing full length transcripts, resulting in expression profiles (datastore of transcript signal and abundance) for 4 samples. [0052] Further validation can be performed via construction of synthetic expression profiles from publically available Illumina data, where nucleotide sequences are converted into synthetic nanopore signals and (kallisto/de novo whole transcriptome assembly) used to quantify transcript abundance. These matched tissue/sample differences are were then compared to similar tissues/samples sequenced with ONT and concordance was found between the meta expression profile analysis.
Real-time sequencing
[0053] There are clinical applications where a diagnosis should be available within a relatively short time window. For example, a patient presenting sepsis may arrive at an emergency department of a hospital and a treatment needs to be commenced before a time-consuming sequencing process can be performed. Even the download of a full data file of the sequencing result may take too long for this situation.
[0054] With the method proposed herein, the profiles are formatted such that they are compatible with a real-time processing of the sequencing data stream. That is, the sequencing signal is received and while the sequencing signal is being received (before the full data is available), a diagnosis can be made by the proposed method. In this sense, the indication of abundances in the profiles is continuously updated and after every update or periodically (such as every minute or every 5 minutes) the profile is matched against the stored profiles. In particular, one of the stored profiles may be the typical profile of a sepsis patient and a good match indicates sepsis as a diagnosis and treatment can be commenced straight away and within a short time window, such as within 10 minutes or within 30 minutes. This also means that the receiving of the sequencing data can be stopped before the full data has been received and as soon as a diagnosis has been provided.
[0055] In this sense, the data stream is processed in real time, while the stream is being generated. For example, a whole genome sequencing such as Illumina sequencing may be performed off-site but the dataset is too large to transmit via a relatively slow internet connection. For example, it may take three days to transmit the entire dataset which is too long for some diagnoses, such as sepsis.
[0056] It is an advantage that the sequences are ordered by abundance and the matching score represents the difference in the position of the sequence within the ordered sequences, because the most abundant sequences are likely to be sequenced at larger numbers early and therefore provide a robust diagnosis. In other words, the diagnosis is performed based on the most abundant (i.e. most accurate) sequences. In one example, the comparison between profiles is not performed on all available sequences but only on the top most abundant sequences (such as top 10 or top 100 sequences).
[0057] In one example, there is a threshold on the matching score and the analysis (i.e. receiving of further sequences) is stopped as soon as the threshold is met. For example, where a higher matching score indicates a worse match, the analysis is stopped as soon as the matching score is below the threshold (such as 100 in the example of Fig. 3).
[0058] While the sequences may comprise base calls, it is also possible that they comprise a time domain electrical signal, also referred to as squiggle, which may be indicative of the current through a nanopore while the bases pass through the nanopore. The advantage of using squiggles is that it is not necessary to call bases from the squiggle (i.e. convert the squiggle into sequence), which speeds up the process and increases reliability as approximations are removed. It is possible to used BLAST, minimap2, for sequence matching instead of DWT for squiggle matching.
[0059] It is noted that the method described herein is performed by a computer system comprising an input port to receive the sequences (such as USB) and a processor to create/update the expression profiles and the compare the expression profile against the database. The database may be local or remote and the comparison (i.e. calculating a matching score) may be performed remotely, such as in a cloud computing
environment. It is noted that the bandwidth required for the cloud computing implementation is minimal because it is not necessary to upload the entire sequencing data set at once but only as it is generated by the sequencer. In that case, the library of expression profdes would also be stored in the cloud and matched there. This allows the use of relatively large libraries without the need for local data storage and without the need for full transfer of the entire sequencing data set as an upload from the sequencer. This has the significant technical advantage that the analysis of the sequencing data can be performed much faster because it is not necessary to wait for the upload to finish.
[0060] Fig. 5 illustrates a method 500 for diagnosis of sepsis in a sample from a patient using streaming data from a sequencer. The method comprises receiving 501 multiple sequences of the sample from the sequencer and generating 502 an expression profile for the sample. The expression profile comprises for each of the multiple sequences an indication of abundance of that sequence in the sample.
[0061] Method 500 also comprises receiving 503 further sequences as streaming data from the sequencer and while receiving 504 the further sequences, the method 500 comprises performing the steps of:
• updating 505 the expression profile for the sample;
• performing 506 a comparison of the expression profile for the sample to a stored expression profile indicative of an abundance of sequences when sepsis is present;
• determining 507 whether the patient has sepsis based on the comparison; and
• upon determining whether the patient has sepsis terminating 508 the receiving of the further sequences.
[0062] Fig. 6 illustrates method 600 for determining a state of a biological sample using streaming data from a sequencer. Method 600 comprises receiving 601 multiple sequences of the sample from the sequencer and generating 602 an expression profile for the sample, the expression profile comprising for each of the multiple sequences an indication of abundance of that sequence in the sample. [0063] Method 600 further comprises receiving further sequences as streaming data from the sequencer and while receiving 604 the further sequences performing the steps of:
• updating 605 the expression profile for the sample;
• performing 606 a comparison of the expression profile for the sample to one or more stored expression profiles to determine a matching stored expression profile, each of the one or more stored expression profiles being associated with a respective predefined state of the sample;
• determining 607 the state of the sample as the state associated with the matching stored expression profile; and
• upon determining the state of the sample terminating 608 the receiving of the further sequences.
[0064] It will be appreciated by persons skilled in the art that numerous variations and/or modifications may be made to the above-described embodiments, without departing from the broad general scope of the present disclosure. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive.

Claims

CLAIMS:
1. A method for diagnosis of sepsis in a sample from a patient using streaming data from a sequencer, the method comprising:
receiving multiple sequences of the sample from the sequencer;
generating an expression profile for the sample, the expression profile comprising for each of the multiple sequences an indication of abundance of that sequence in the sample;
receiving further sequences as streaming data from the sequencer;
while receiving the further sequences performing the steps of:
updating the expression profile for the sample;
performing a comparison of the expression profile for the sample to a stored expression profile indicative of an abundance of sequences when sepsis is present;
determining whether the patient has sepsis based on the comparison; and upon determining whether the patient has sepsis terminating the receiving of the further sequences.
2. A method for determining a state of a biological sample using streaming data from a sequencer, the method comprising:
receiving multiple sequences of the sample from the sequencer;
generating an expression profile for the sample, the expression profile comprising for each of the multiple sequences an indication of abundance of that sequence in the sample;
receiving further sequences as streaming data from the sequencer;
while receiving the further sequences performing the steps of:
updating the expression profile for the sample;
performing a comparison of the expression profile for the sample to one or more stored expression profiles to determine a matching stored expression profile, each of the one or more stored expression profiles being associated with a respective predefined state of the sample; determining the state of the sample as the state associated with the matching stored expression profile; and
upon determining the state of the sample terminating the receiving of the further sequences.
3. The method of claim 2, wherein the state of the sample comprises a tissue of origin as determined as the tissue of origin associated with the matching stored expression profile.
4. The method of claim 1, 2 or 3, wherein the sequencing data comprises a stream of consecutive information including the read sequences.
5. The method of any one of the preceding claims, wherein the sequencer comprises a nanopore continuously generating the sequencing data.
6. The method of any one of the preceding claims, wherein the expression profiles comprise a representation of an electric signal in a time-domain that corresponds to a read direction along the sequence.
7. The method of any one of the preceding claims, wherein the comparison is based on comparing the sequences in the expression profiles.
8. The method of claim 7, wherein comparing the sequences is based on comparing features extracted from the sequences.
9. The method of any one of the preceding claims, wherein the expression profiles comprise a list of sequences that is ordered by the respective abundances.
10. The method of any one of the preceding claims, wherein performing the comparison comprises calculating a matching score between the expression profile for the sample and the one or more stored expression profiles.
11. The method of claim 10, wherein the matching score is based on an order of sequences in the expression profiles by respective abundances.
12. The method of claim 11, wherein the matching score is based on a difference in a position of a sequence within the ordered sequences between the expression profiles.
13. The method of claim 10 or 11, wherein the matching score is based on a rank correlation coefficient.
14. The method of any one of the preceding claims, wherein the state of the sample is determined and the receiving of the further sequences is terminated when a matching score determined by the comparison meets a pre-defmed threshold.
PCT/AU2019/051049 2018-09-27 2019-09-27 Expression profiling WO2020061643A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
AU2018903657A AU2018903657A0 (en) 2018-09-27 Expression profiling
AU2018903657 2018-09-27

Publications (1)

Publication Number Publication Date
WO2020061643A1 true WO2020061643A1 (en) 2020-04-02

Family

ID=69949185

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/AU2019/051049 WO2020061643A1 (en) 2018-09-27 2019-09-27 Expression profiling

Country Status (1)

Country Link
WO (1) WO2020061643A1 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2011106536A2 (en) * 2010-02-24 2011-09-01 The Broad Institute, Inc Methods of diagnosing infectious disease pathogens and their drug sensitivity
US9322820B2 (en) * 2013-03-14 2016-04-26 Wisconsin Alumni Research Foundation System and apparatus for nanopore sequencing
WO2017106918A1 (en) * 2015-12-24 2017-06-29 Immunexpress Pty Ltd Triage biomarkers and uses therefor

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2011106536A2 (en) * 2010-02-24 2011-09-01 The Broad Institute, Inc Methods of diagnosing infectious disease pathogens and their drug sensitivity
US9322820B2 (en) * 2013-03-14 2016-04-26 Wisconsin Alumni Research Foundation System and apparatus for nanopore sequencing
WO2017106918A1 (en) * 2015-12-24 2017-06-29 Immunexpress Pty Ltd Triage biomarkers and uses therefor

Similar Documents

Publication Publication Date Title
US10347365B2 (en) Systems and methods for visualizing a pattern in a dataset
US11954614B2 (en) Systems and methods for visualizing a pattern in a dataset
Torang et al. An elastic-net logistic regression approach to generate classifiers and gene signatures for types of immune cells and T helper cell subsets
US20230222311A1 (en) Generating machine learning models using genetic data
CA3049682C (en) Methods for non-invasive assessment of genetic alterations
Yu et al. Statistical and bioinformatics analysis of data from bulk and single-cell RNA sequencing experiments
Larsson et al. Comparative microarray analysis
CN111276252B (en) Construction method and device of tumor benign and malignant identification model
EP2556185B1 (en) Gene-expression profiling with reduced numbers of transcript measurements
JP2022512829A (en) Methods and machine learning for disease diagnosis
JP7041614B2 (en) Multi-level architecture for pattern recognition in biometric data
CA3049457C (en) Methods for non-invasive assessment of copy number alterations
CN112289376B (en) Method and device for detecting somatic cell mutation
AU2016355983A1 (en) Methods for detecting copy-number variations in next-generation sequencing
CN113823356B (en) Methylation site identification method and device
KR102124193B1 (en) Method for screening makers for predicting depressive disorder or suicide risk using machine learning, markers for predicting depressive disorder or suicide risk, method for predicting depressive disorder or suicide risk
US12073921B2 (en) System for increasing the accuracy of non invasive prenatal diagnostics and liquid biopsy by observed loci bias correction at single base resolution
US20220259657A1 (en) Method for discovering marker for predicting risk of depression or suicide using multi-omics analysis, marker for predicting risk of depression or suicide, and method for predicting risk of depression or suicide using multi-omics analysis
CN116312800A (en) Lung cancer characteristic identification method, device and storage medium based on circulating RNA whole transcriptome sequencing in blood plasma
US20240153588A1 (en) Systems and methods for identifying microbial biosynthetic genetic clusters
WO2020061643A1 (en) Expression profiling
CN111164701A (en) Fixed-point noise model for target sequencing
CN117616505A (en) Systems and methods for correlating compounds with physiological conditions using fingerprinting
CN110462056A (en) Samples sources detection method, device and storage medium based on DNA sequencing data
EP3635138B1 (en) Method for analysing cell-free nucleic acids

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19867022

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19867022

Country of ref document: EP

Kind code of ref document: A1