CN115667554A - Method and system for detecting colorectal cancer by nucleic acid methylation analysis - Google Patents

Method and system for detecting colorectal cancer by nucleic acid methylation analysis Download PDF

Info

Publication number
CN115667554A
CN115667554A CN202180039398.8A CN202180039398A CN115667554A CN 115667554 A CN115667554 A CN 115667554A CN 202180039398 A CN202180039398 A CN 202180039398A CN 115667554 A CN115667554 A CN 115667554A
Authority
CN
China
Prior art keywords
methylation
methylated
genomic regions
tables
colorectal cancer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202180039398.8A
Other languages
Chinese (zh)
Inventor
约翰·圣约翰
史蒂文·科腾-希尔
杨睿
A·德拉克
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Frinum Holdings
Original Assignee
Frinum Holdings
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Frinum Holdings filed Critical Frinum Holdings
Publication of CN115667554A publication Critical patent/CN115667554A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6876Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
    • C12Q1/6883Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material
    • C12Q1/6886Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material for cancer
    • CCHEMISTRY; METALLURGY
    • C40COMBINATORIAL TECHNOLOGY
    • C40BCOMBINATORIAL CHEMISTRY; LIBRARIES, e.g. CHEMICAL LIBRARIES
    • C40B40/00Libraries per se, e.g. arrays, mixtures
    • C40B40/04Libraries containing only organic compounds
    • C40B40/06Libraries containing nucleotides or polynucleotides, or derivatives thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/10Ploidy or copy number detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6806Preparing nucleic acids for analysis, e.g. for polymerase chain reaction [PCR] assay
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6869Methods for sequencing
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2600/00Oligonucleotides characterized by their use
    • C12Q2600/154Methylation markers

Abstract

The present disclosure provides methods and systems for screening or detecting the progression of colorectal cancer or subsequent colorectal disease, which are applicable to cell-free nucleic acids, such as cell-free DNA. The method can train a machine learning model using detection of methylation signals within a single sequencing read in an identified genomic region as input features and generate classifiers suitable for stratifying a population of individuals. The method may include extracting DNA from a cell-free sample obtained from a subject, transforming the DNA for methylation sequencing, generating sequencing reads, and detecting a signal associated with a colon proliferative cell disorder in the sequencing information, and training a machine learning model to provide a discriminator capable of distinguishing groupings such as health, cancer, etc., or distinguishing disease subtypes or stages in a population of subjects. The methods can be used, for example, to predict, prognose, and/or monitor response to treatment, tumor burden, recurrence, or progression of colorectal cancer.

Description

Method and system for detecting colorectal cancer by nucleic acid methylation analysis
Cross Reference to Related Applications
The present application claims the benefit of U.S. provisional patent application No. 63/002,878, filed 3/31/2020, the contents of which are hereby incorporated by reference in their entirety.
Background
The present disclosure relates generally to cancer detection and disease monitoring. More specifically, the field relates to cancer-associated DNA methylation detection and disease monitoring of early colorectal cancer (CRC). Cancer screening and monitoring can help improve outcomes over the past decades, since early detection leads to better outcomes, and cancer can be eliminated before it spreads. For example, in the case of CRC, the use of colonoscopy may play a role in improving early diagnosis. Unfortunately, challenges may arise due to patient compliance with screening not reaching recommended rules.
The main problem with any screening tool can be the compromise between false positive and false negative results (or specificity and sensitivity), leading to unnecessary investigation in the former case and inefficiency in the latter case. The ideal test may be one with a high Positive Predictive Value (PPV), minimizing unnecessary investigation, but able to detect the vast majority of cancers. Another key factor may be the so-called "detection sensitivity" and also the lower limit of detection of tumor size, which is distinguished from the test sensitivity. Unfortunately, waiting for the tumor to grow large enough to release the circulating tumor markers at the levels necessary for detection may have been incompatible with the need for early detection in order to treat the tumor at the stage where treatment is most effective. Thus, there is a need for effective blood-based screening for early CRC based on circulating analytes.
The detection of circulating tumor DNA is increasingly considered to be a viable "liquid biopsy" allowing detection and informative investigation of tumors in a non-invasive manner. In some cases, these techniques have been applied to colon, breast and prostate cancers through the identification of tumor-specific mutations. The sensitivity of these techniques may be limited due to the presence of a high background of normal (e.g., non-tumor-derived) DNA in the circulation.
Detection of tumor specific methylation in blood can provide significant advantages over detection of mutations. In cancers including lung, colon, and breast cancers, a number of single or multiple methylation biomarkers can be assessed. These may have low sensitivity, as they may not be prevalent in tumors.
There remains a need for more sensitive and specific screening tools to detect colorectal cancer tumor signals in early stages or low tumor burden in recurrence, and to perform primary screening in high risk populations.
Disclosure of Invention
The present disclosure provides methods and systems relating to gene methylation profiling in association with colorectal cancer detection and disease progression.
In one aspect, the present disclosure provides a methylation signature panel (methylation signature panel) specific for a colon cell proliferative disorder, comprising: one or more methylated genomic regions selected from table 11, wherein the one or more regions are more methylated in a biological sample from an individual having a colonic cell proliferative disorder or a subtype of a colonic cell proliferative disorder and are less methylated in normal tissue and normal blood cells of an individual not having a colonic cell proliferative disorder.
In some embodiments, the biological sample is nucleic acid, DNA, ribonucleic acid (RNA), or cell-free nucleic acid (e.g., cfDNA or cfRNA).
In some embodiments, the genomic region is divided into non-coding regions, or non-transcribed or regulatory regions.
In some embodiments, the signature panel comprises increased methylation in two or more genomic regions selected from table 11.
In some embodiments, the biological sample obtained from the subject is selected from the group consisting of: cell-free DNA, cell-free RNA, bodily fluids, stool, colonic discharge, urine, plasma, serum, whole blood, isolated blood cells, cells isolated from blood, and combinations thereof.
In some embodiments, the colonic cell proliferative disorder is selected from the group consisting of: adenomas (adenomatous polyps), sessile Serrated Adenomas (SSA), advanced adenomas, colorectal dysplasias, colorectal adenomas, colorectal cancer (colorectal carcinoma), colon cancer, rectal cancer, colorectal epithelial cancer (colorectal carcinoma), colorectal adenocarcinoma, carcinoid tumors, gastrointestinal stromal tumors (GIST), lymphomas, and sarcomas. In some embodiments, the colon cell proliferative disorder comprises colorectal cancer.
In some embodiments, the colon cell proliferative disorder is selected from stage 1 colorectal cancer, stage 2 colorectal cancer, stage 3 colorectal cancer, or stage 4 colorectal cancer.
In some embodiments, the signature panel comprises two or more methylated genomic regions in tables 1-11, three or more methylated genomic regions in tables 1-11, four or more methylated genomic regions in tables 1-11, five or more methylated genomic regions in tables 1-11, six or more methylated genomic regions in tables 1-11, seven or more methylated genomic regions in tables 1-11, eight or more methylated genomic regions in tables 1-11, nine or more methylated genomic regions in tables 1-11, ten or more methylated genomic regions in tables 1-11, eleven or more methylated genomic regions in tables 1-11, twelve or more methylated genomic regions in tables 1-11, or thirteen or more methylated genomic regions in tables 1-11.
In some embodiments, the signature panel comprises genomic regions that are methylated in colorectal cancer, including methylated regions in one or more genomic regions selected from the group consisting of: ITGA4, EMBP1, TMEM163, SFMBT2, ELMO and ZNF543.
In some embodiments, the region that is methylated in colorectal cancer comprises a methylated region in the genomic regions of ITGA4 and EMBP 1.
In some embodiments, the region that is methylated in colorectal cancer comprises a methylated region in one or more genomic regions selected from the group consisting of: ITGA4, EMBP1, TMEM163, SFMBT2, ELMO, ZNF543, CHST10, CCNA1, BEND4, KRBA1, S1PR1 and PPP1R16B.
In some embodiments, the signature panel comprises methylated genomic regions selected from table 1, table 2, table 3, table 4, table 5, table 6, table 7, table 8, table 9, table 10, and table 11.
In another aspect, the present disclosure provides a methylation signature panel that is characteristic of a colon cell proliferative disorder, comprising: two or more methylated genomic regions of tables 1-11, wherein the two or more regions are more methylated in a biological sample from an individual having a colonic cell proliferative disorder or a subtype of a colonic cell proliferative disorder and are less methylated in normal tissue and normal blood cells of an individual not having a colonic cell proliferative disorder.
In some embodiments, the biological sample is nucleic acid, DNA, ribonucleic acid (RNA), or cell-free nucleic acid (cfDNA or cfRNA).
In some embodiments, the genomic region is divided into non-coding regions, or non-transcribed or regulatory regions.
In some embodiments, the signature panel comprises an increase in methylation in 6 or more or 12 or more genomic regions from tables 1-11.
In some embodiments, the biological sample obtained from the subject is selected from the group consisting of: cell-free DNA, cell-free RNA, bodily fluids, stool, colonic discharge, urine, plasma, serum, whole blood, isolated blood cells, cells isolated from blood, and combinations thereof.
In some embodiments, the colonic cell proliferative disorder is selected from the group consisting of: adenomas (adenomatous polyps), sessile Serrated Adenomas (SSA), advanced adenomas, colorectal dysplasias, colorectal adenomas, colorectal cancer, colon cancer, rectal cancer, colorectal epithelial carcinoma, colorectal adenocarcinoma, carcinoid tumors, gastrointestinal stromal tumors (GIST), lymphomas, and sarcomas. In some embodiments, the colon cell proliferative disorder comprises colorectal cancer.
In some embodiments, the colon cell proliferative disorder is selected from stage 1 colorectal cancer, stage 2 colorectal cancer, stage 3 colorectal cancer, or stage 4 colorectal cancer.
In some embodiments, the signature panel comprises three or more methylated genomic regions of tables 1-11, four or more methylated genomic regions of tables 1-11, five or more methylated genomic regions of tables 1-11, six or more methylated genomic regions of tables 1-11, seven or more methylated genomic regions of tables 1-11, eight or more methylated genomic regions of tables 1-11, nine or more methylated genomic regions of tables 1-11, ten or more methylated genomic regions of tables 1-11, eleven or more methylated genomic regions of tables 1-11, twelve or more methylated genomic regions of tables 1-11, or thirteen or more methylated genomic regions of tables 1-11.
In some embodiments, the signature panel comprises genomic regions that are methylated in colorectal cancer, including methylated regions in one or more genomic regions selected from the group consisting of: ITGA4, EMBP1, TMEM163, SFMBT2, ELMO and ZNF543.
In some embodiments, the region that is methylated in colorectal cancer comprises a methylated region in the genomic regions of ITGA4 and EMBP 1.
In some embodiments, the region that is methylated in colorectal cancer comprises a methylated region in one or more genomic regions selected from the group consisting of: ITGA4, EMBP1, TMEM163, SFMBT2, ELMO, ZNF543, CHST10, CCNA1, BEND4, KRBA1, S1PR1 and PPP1R16B.
In some embodiments, the signature panel comprises a methylated region selected from table 1, table 2, table 3, table 4, table 5, table 6, table 7, table 8, table 9, table 10, and table 11.
In another aspect, the present disclosure provides a classifier (e.g., a machine learning classifier) capable of distinguishing a population of healthy individuals from a population of individuals having a colonic cell proliferative disorder, comprising: a) A set of measurements representative of differentially methylated genomic regions, wherein the measurements are obtained from methylation sequencing data from a healthy subject and a subject having a colon cell proliferative disorder; b) Wherein the measurements are used to generate a set of features corresponding to characteristics of the differentially methylated genomic region and input the features to a machine learning or statistical model; and c) wherein the model provides feature vectors that serve as a classifier capable of distinguishing a population of healthy individuals from individuals suffering from a colonic cell proliferative disorder.
In some embodiments, the set of measurements describes a characteristic of a methylation region selected from the group consisting of: percent base-by-base methylation of CpG, CHG, CHH, counts or ratios of fragments with different counts or ratios of methylated CpG observed in a region, conversion efficiency (100-average percent methylation of CHH), hypomethylated segments, methylation level (global average methylation of CpG, CHH, CHG, fragment length, fragment midpoint, and methylation level in one or more genomic regions such as chrM, LINE1, or ALU), number of methylated CpG per fragment, fraction of CpG methylation per fragment over total CpG per region, fraction of CpG methylation per region over total CpG per region, fraction of CpG methylation per panel over total CpG, dinucleotide coverage (normalized dinucleotide coverage), coverage uniformity (unique sites under 1x and 10x average genomic coverage (CGI run for S4)), global average CpG coverage (depth), and average coverage at CpG islands, shelves and CGI banks.
In some embodiments, the machine-learned model comprises a classifier loaded into a memory of a computer system, the machine-learned model trained using training vectors obtained from training biological samples, a first subset of the training biological samples identified as having a colonic cell proliferative disorder, and a second subset of the training biological samples identified as not having a colonic cell proliferative disorder.
In some embodiments, the classifier is provided in a system for detecting a colon cell proliferative disorder, the system comprising: a) A computer readable medium comprising a classifier operable to classify a subject as having or not having a colon cell proliferative disorder according to a methylation signature panel; and b) one or more processors configured to execute instructions stored on the computer-readable medium.
In some embodiments, the system includes a classification loop configured as a machine learning classifier selected from the group consisting of: deep learning classifiers, neural network classifiers, linear Discriminant Analysis (LDA) classifiers, quadratic Discriminant Analysis (QDA) classifiers, support Vector Machine (SVM) classifiers, random Forest (RF) classifiers, linear kernel support vector machine classifiers, first-order or second-order polynomial kernel support vector machine classifiers, ridge regression classifiers, elastic net algorithm classifiers, sequence minimum optimization algorithm classifiers, naive Bayes algorithm classifiers, and principal component analysis classifiers.
In some embodiments, the computer-readable medium is a non-transitory computer-readable medium comprising machine-executable code that, when executed by one or more computer processors, implements any of the methods described above or elsewhere herein.
In some embodiments, the system includes one or more computer processors and computer memory coupled thereto. The computer memory includes machine executable code that, when executed by one or more computer processors, implements any of the methods described herein.
In another aspect, the present disclosure provides a method for determining a methylation profile of a cell-free deoxyribonucleic acid (cfDNA) sample from an individual, comprising: a) Providing conditions capable of converting unmethylated cytosines to uracil in nucleic acid molecules of a cfDNA sample to produce a plurality of converted nucleic acids; b) Contacting the plurality of transformed nucleic acids with a nucleic acid probe complementary to a pre-identified methylation signature panel selected from at least two differentially methylated regions of tables 1-11 to enrich for sequences corresponding to the signature panel; c) Determining the nucleic acid sequence of the plurality of transformed nucleic acid molecules; and d) aligning the nucleic acid sequences of the plurality of transformed nucleic acid molecules with a reference nucleic acid sequence, thereby determining the methylation profile of the individual.
In some embodiments, the nucleic acid sequencing library is prepared prior to amplification. In some embodiments, the method further comprises amplifying the plurality of transformed nucleic acids. In some embodiments, the amplification comprises Polymerase Chain Reaction (PCR). In some embodiments, the method further comprises determining the nucleic acid sequence of the transformed nucleic acid molecule at a depth of greater than 1000x, greater than 2000x, greater than 3000x, greater than 4000x, or greater than 5000 x. In some embodiments, the reference nucleic acid sequence is at least a portion of a human reference genome. In some embodiments, the human reference genome is hg18.
In some embodiments, the methylation profile is associated with a colon cell proliferative disorder, and provides a classification of a subject as having a colon cell proliferative disorder.
In some embodiments, the nucleic acid aptamer comprising a unique molecular identifier is ligated to untransformed nucleic acids in the cfDNA sample prior to a).
In some embodiments, the nucleic acid molecule is subjected to conditions for conversion of cytosine to uracil using chemical means, enzymatic means, or a combination thereof.
In some embodiments, cfDNA in a biological sample is treated with an agent selected from the group consisting of: bisulfite (bisufite), hydrogen sulfite (hydrogen sulfite), disulfide, and combinations thereof.
In some embodiments, the biological sample obtained from the subject is selected from the group consisting of: cell-free DNA, cell-free RNA, bodily fluids, stool, colonic discharge, urine, plasma, serum, whole blood, isolated blood cells, cells isolated from blood, and combinations thereof.
In some embodiments, the method comprises comparing a methylation signature panel measured from a subject to a database of methylation signature panels measured from normal subjects, wherein the database is stored in a computer system; determining an increased risk of the subject for developing a colonic cell proliferative disorder by measuring a change in methylation status of a methyl signature panel of at least 1%, at least 2%, at least 3%, at least 4%, at least 5%, at least 6%, at least 7%, at least 8%, at least 9%, at least 10%, at least 11%, at least 12%, at least 13%, at least 14%, at least 15%, at least 16%, at least 17%, at least 18%, at least 19%, or at least 20% as compared to the methylation status from a normal subject.
In some embodiments, the pre-identified methylation signature panel comprises three or more methylated genomic regions in tables 1-11, four or more methylated genomic regions in tables 1-11, five or more methylated genomic regions in tables 1-11, six or more methylated genomic regions in tables 1-11, seven or more methylated genomic regions in tables 1-11, eight or more methylated genomic regions in tables 1-11, nine or more methylated genomic regions in tables 1-11, ten or more methylated genomic regions in tables 1-11, eleven or more methylated genomic regions in tables 1-11, twelve or more methylated genomic regions in tables 1-11, or thirteen or more methylated genomic regions in tables 1-11. In some embodiments, the pre-identified methylation signature panel comprises one or more methylated genomic regions in table 11, two or more methylated genomic regions in table 11, or three methylated genomic regions in table 11. In some embodiments, the methylation profile is indicative of the presence or absence of a colon cell proliferative disorder in the individual.
In some embodiments, the colonic cell proliferative disorder is selected from the group consisting of: adenomas (adenomatous polyps), sessile Serrated Adenomas (SSA), advanced adenomas, colorectal dysplasias, colorectal adenomas, colorectal cancer, colon cancer, rectal cancer, colorectal epithelial cancer, colorectal adenocarcinoma, carcinoid tumors, gastrointestinal stromal tumors (GIST), lymphomas, and sarcomas. In some embodiments, the colon cell proliferative disorder comprises colorectal cancer.
In some embodiments, the colon cell proliferative disorder is selected from the group consisting of stage 1 colorectal cancer, stage 2 colorectal cancer, stage 3 colorectal cancer, and stage 4 colorectal cancer.
In another aspect, the present disclosure provides a method for detecting the presence or absence of a colon cell proliferative disorder in a subject, comprising: a) Providing conditions capable of converting unmethylated cytosines to uracil in nucleic acid molecules of a biological sample obtained or derived from a subject to produce a plurality of converted nucleic acids; b) Contacting the plurality of transformed nucleic acids with a nucleic acid probe complementary to a pre-identified methylation signature panel selected from at least two differentially methylated regions of tables 1-11 to enrich for sequences corresponding to the signature panel; c) Determining the nucleic acid sequence of the plurality of transformed nucleic acid molecules; d) Aligning the nucleic acid sequences of the plurality of transforming nucleic acid molecules with a reference nucleic acid sequence, thereby determining the methylation profile of the individual; and e) applying a trained machine learning model to the methylation profile, wherein the trained machine learning model is trained to be able to distinguish between a healthy individual and an individual suffering from a colonic cell proliferative disorder to provide an output related to the presence of the colonic cell proliferative disorder, thereby detecting the presence or absence of the colonic cell proliferative disorder in the subject.
In some embodiments, the nucleic acid sequencing library is prepared prior to amplification. In some embodiments, the method further comprises amplifying the plurality of transformed nucleic acids. In some embodiments, the amplification comprises Polymerase Chain Reaction (PCR). In some embodiments, the method further comprises determining the nucleic acid sequence of the transformed nucleic acid molecule at a depth of greater than 1000x, greater than 2000x, greater than 3000x, greater than 4000x, or greater than 5000 x. In some embodiments, the reference nucleic acid sequence is at least a portion of a human reference genome. In some embodiments, the human reference genome is hg18.
In some embodiments, the biological sample obtained from the subject is selected from the group consisting of: cell-free DNA, cell-free RNA, bodily fluids, stool, colonic discharge, urine, plasma, serum, whole blood, isolated blood cells, cells isolated from blood, and combinations thereof.
In some embodiments, the method comprises comparing a methylation signature panel measured from a subject to a database of methylation signature panels measured from normal subjects, wherein the database is stored in a computer system; an increased risk of the subject to suffer from a colonic cell proliferative disorder is determined by measuring a change in methylation status of a methyl signature panel of at least 1%, at least 2%, at least 3%, at least 4%, at least 5%, at least 6%, at least 7%, at least 8%, at least 9%, at least 10%, at least 11%, at least 12%, at least 13%, at least 14%, at least 15%, at least 16%, at least 17%, at least 18%, at least 19%, or at least 20% as compared to the methylation status from a normal subject.
In some embodiments, the pre-identified methylation signature panel comprises three or more methylated genomic regions in tables 1-11, four or more methylated genomic regions in tables 1-11, five or more methylated genomic regions in tables 1-11, six or more methylated genomic regions in tables 1-11, seven or more methylated genomic regions in tables 1-11, eight or more methylated genomic regions in tables 1-11, nine or more methylated genomic regions in tables 1-11, ten or more methylated genomic regions in tables 1-11, eleven or more methylated genomic regions in tables 1-11, twelve or more methylated genomic regions in tables 1-11, or thirteen or more methylated genomic regions in tables 1-11. In some embodiments, the pre-identified methylation signature panel comprises one or more methylated genomic regions in table 11, two or more methylated genomic regions in table 11, or three methylated genomic regions in table 11. In some embodiments, the methylation profile is indicative of the presence or absence of a colon cell proliferative disorder in the individual. In some embodiments, the method further comprises administering to the individual a treatment for the colon cell proliferative disorder based on detecting the presence of the colon cell proliferative disorder in the individual.
In some embodiments, the colonic cell proliferative disorder is selected from the group consisting of: adenomas (adenomatous polyps), sessile Serrated Adenomas (SSA), advanced adenomas, colorectal dysplasias, colorectal adenomas, colorectal cancer, colon cancer, rectal cancer, colorectal epithelial carcinoma, colorectal adenocarcinoma, carcinoid tumors, gastrointestinal stromal tumors (GIST), lymphomas, and sarcomas. In some embodiments, the colon cell proliferative disorder comprises colorectal cancer.
In some embodiments, the trained machine learning classifier is selected from: deep learning classifiers, neural network classifiers, linear Discriminant Analysis (LDA) classifiers, quadratic Discriminant Analysis (QDA) classifiers, support Vector Machine (SVM) classifiers, random Forest (RF) classifiers, linear kernel support vector machine classifiers, first-order or second-order polynomial kernel support vector machine classifiers, ridge regression classifiers, elastic net algorithm classifiers, sequence-minimization optimization algorithm classifiers, naive Bayes algorithm classifiers, and principal component analysis classifiers.
In some embodiments, the colon cell proliferative disorder is selected from the group consisting of stage 1 colorectal cancer, stage 2 colorectal cancer, stage 3 colorectal cancer, and stage 4 colorectal cancer.
In another aspect, the present disclosure provides a method for monitoring minimal residual disease in a subject previously treated for the disease, comprising: the methylation profile described herein is determined as a baseline methylation state and the analysis is repeated to determine the methylation profile at one or more predetermined time points, wherein a change from the baseline is indicative of a change in the minimal residual disease condition at the baseline of the subject.
In some embodiments, the minimal residual disease is selected from the group consisting of response to treatment, tumor burden, post-operative residual tumor, recurrence, secondary screening, primary screening, and cancer progression.
In another aspect, a method for determining a response to a treatment is provided.
In another aspect, a method for monitoring tumor burden is provided.
In another aspect, a method for detecting a residual tumor after surgery is provided.
In another aspect, a method for detecting relapse is provided.
In another aspect, a method for use as a secondary screening is provided.
In another aspect, a method is provided for use as a primary screening.
In another aspect, a method for monitoring cancer progression is provided.
In some embodiments, the data set is indicative of the presence or susceptibility of colorectal cancer at a sensitivity of at least about 80%. In some embodiments, the data set is indicative of the presence or susceptibility of colorectal cancer at a sensitivity of at least about 90%. In some embodiments, the data set is indicative of the presence or susceptibility of colorectal cancer at a sensitivity of at least about 95%. In some embodiments, the data set indicates the presence or susceptibility of colorectal cancer at a Positive Predictive Value (PPV) of at least about 70%. In some embodiments, the data set indicates the presence or susceptibility of colorectal cancer at a Positive Predictive Value (PPV) of at least about 80%. In some embodiments, the data set indicates the presence or susceptibility of colorectal cancer at a Positive Predictive Value (PPV) of at least about 90%. In some embodiments, the data set indicates the presence or susceptibility of colorectal cancer at a Positive Predictive Value (PPV) of at least about 95%. In some embodiments, the data set indicates the presence or susceptibility of colorectal cancer at a Positive Predictive Value (PPV) of at least about 99%. In some embodiments, the data set indicates the presence or susceptibility of colorectal cancer at a Negative Predictive Value (NPV) of at least about 80%. In some embodiments, the data set indicates the presence or susceptibility of colorectal cancer at a Negative Predictive Value (NPV) of at least about 90%. In some embodiments, the data set indicates the presence or susceptibility of colorectal cancer at a Negative Predictive Value (NPV) of at least about 95%. In some embodiments, the data set indicates the presence or susceptibility of colorectal cancer at a Negative Predictive Value (NPV) of at least about 99%. In some embodiments, the trained algorithm determines the presence or susceptibility of colorectal cancer in the subject by an area under the curve (AUC) of at least about 0.90. In some embodiments, the trained algorithm determines the presence or susceptibility of colorectal cancer in the subject by an area under the curve (AUC) of at least about 0.95. In some embodiments, the trained algorithm determines the presence or susceptibility of colorectal cancer in the subject by an area under the curve (AUC) of at least about 0.99.
In some embodiments, the method further comprises displaying the report on a graphical user interface of the electronic device of the user. In some embodiments, the user is a subject, an individual, or a patient.
In some embodiments, the method further comprises determining the likelihood of the presence or susceptibility of colorectal cancer in the subject, individual or patient. For example, the likelihood may be a probability value between 0% and 100%.
In some embodiments, the trained algorithm (e.g., machine learning model or classifier) comprises a supervised machine learning algorithm. In some embodiments, the supervised machine learning algorithm comprises a deep learning algorithm, a Support Vector Machine (SVM), a neural network, or a Random Forest (Random Forest).
In some embodiments, the method further comprises providing the subject with a therapeutic intervention based at least in part on the methylation profile or analysis, such as a therapeutic intervention (e.g., chemotherapy, radiation therapy, immunotherapy, or surgery) to treat a colorectal cancer patient.
In some embodiments, the method further comprises monitoring for the presence or susceptibility to colorectal cancer, wherein the monitoring comprises assessing the presence or susceptibility to colorectal cancer in the subject at a plurality of time points, wherein the assessment is based at least on the determined presence or susceptibility to colorectal cancer at each of the plurality of time points.
In some embodiments, a difference in the assessment of the presence or susceptibility of colorectal cancer in the subject at a plurality of time points is indicative of one or more clinical indications selected from the group consisting of: (ii) a prognosis of the presence or susceptibility of colorectal cancer in the subject, and (iii) efficacy or ineffectiveness of a course of treatment to treat the presence or susceptibility of colorectal cancer in the subject.
In some embodiments, the method further comprises determining the colorectal cancer subtype of the subject from a plurality of different colorectal cancer subtypes or stages by stratifying the subject's colorectal cancer using a trained algorithm.
Another aspect of the disclosure provides a non-transitory computer-readable medium comprising machine-executable code that, when executed by one or more computer processors, performs any of the methods described above or elsewhere herein.
Another aspect of the disclosure provides a system that includes one or more computer processors and computer memory coupled thereto. The computer memory includes machine executable code that, when executed by one or more computer processors, performs any of the methods described above or elsewhere herein.
Additional aspects and advantages of the present disclosure will become readily apparent to those skilled in the art from the following detailed description, wherein only illustrative embodiments of the present disclosure are shown and described. As will be realized, the disclosure is capable of other and different embodiments and its several details are capable of modifications in various obvious respects, all without departing from the disclosure. Accordingly, the drawings and description are to be regarded as illustrative in nature, and not as restrictive.
Is incorporated by reference
All publications, patents, and patent applications mentioned in this specification are herein incorporated by reference to the same extent as if each individual publication, patent, or patent application was specifically and individually indicated to be incorporated by reference. To the extent publications and patents or patent applications incorporated by reference contradict the disclosure contained in the specification, the specification is intended to supersede and/or take precedence over any such contradictory material.
Drawings
Embodiments of the present disclosure will now be described, by way of example only, with reference to the accompanying drawings. The novel features of the invention are set forth with particularity in the appended claims. A better understanding of the features and advantages of the present invention will be obtained by reference to the following detailed description that sets forth illustrative embodiments, in which the principles of the invention are utilized, and the accompanying drawings (also referred to herein as "figures"), of which:
FIG. 1 provides a schematic diagram of a computer system programmed or otherwise configured with machine learning models and classifiers to implement the methods provided herein.
Figure 2 provides a 4-fold cross-validated area under the curve (AUC) for the model trained on the regions in table 1.
Fig. 3A-3F provide a series of area under the curve (AUC) curves for samples at different stages of CRC trained on the classification model. ROC results are shown in fig. 3A-3F, showing the ability of these Differentially Methylated Regions (DMR) to detect CRC and differentiate early stage cancers, including patients with stage 1 (fig. 3A), stage 2 (fig. 3B), stage 3 (fig. 3C), stage 4 (fig. 3D), deletion stage (fig. 3E), and all samples (fig. 3F).
Detailed Description
While various embodiments of the present invention have been shown and described herein, it will be obvious to those skilled in the art that such embodiments are provided by way of example only. Numerous variations, changes, and substitutions will now occur to those skilled in the art without departing from the invention. It will be appreciated that various alternatives to the embodiments of the invention described herein may be employed.
The present disclosure relates generally to cancer detection and disease monitoring. More specifically, the field relates to cancer-associated DNA methylation detection and disease monitoring of early colorectal cancer. Cancer screening and monitoring can help improve outcomes over the past decades, as early detection leads to better outcomes, and cancer can be eliminated before it spreads. In the case of colorectal cancer, for example, the use of colonoscopy may play a role in improving early diagnosis. Unfortunately, challenges may arise due to patient compliance with screening not reaching recommended rules.
The main problem with any screening tool can be a compromise between false positive and false negative results (or specificity and sensitivity), leading to unnecessary investigation in the former case and ineffectiveness in the latter case. The ideal test may be one with a high Positive Predictive Value (PPV), minimizing unnecessary investigation, but able to detect the vast majority of cancers. Another key factor may be the so-called "detection sensitivity" and also the lower limit of detection of tumor size, which distinguishes it from the test sensitivity. Unfortunately, waiting for the tumor to grow large enough to release the circulating tumor markers at the levels necessary for detection may have been incompatible with the requirement for early detection in order to treat the tumor at the stage where treatment is most effective. Thus, there is a need for an effective blood-based screening for early stage colorectal cancer based on circulating analytes.
The detection of circulating tumor DNA is increasingly considered to be a viable "liquid biopsy" allowing detection and informative investigation of tumors in a non-invasive manner. In some cases, these techniques have been applied to colon, breast and prostate cancers through the identification of tumor-specific mutations. The sensitivity of these techniques may be limited due to the presence of high background normal (e.g., non-tumor-derived) DNA in the circulation.
Detection of tumor specific methylation in blood can provide significant advantages over detection of mutations. In cancers including lung, colon, and breast cancers, a number of single or multiple methylation biomarkers can be assessed. These may have low sensitivity because they may not be prevalent in tumors.
There remains a need for more sensitive and specific screening tools to detect colorectal cancer tumor signals in early stages or low tumor burden in recurrence, and to perform primary screening in high risk populations.
The present disclosure provides methods and systems relating to gene methylation profiling in association with colorectal cancer detection and disease progression.
In one aspect, the disclosure provides methods of using a panel of methylated regions suitable for analysis of methylation within a region or gene, other aspects provide novel uses of the regions, genes and gene products, and methods, assays and kits relating to detecting, differentiating and differentiating colon cell proliferative disorders. The methods and nucleic acids provided herein can be used to analyze a proliferative disorder of a colon cell selected from the group consisting of adenocarcinoma, adenoma, polyp, squamous cell carcinoma, carcinoid tumor, sarcoma, and lymphoma.
In some embodiments, the methods comprise using one or more genes selected from the group consisting of methylated regions as markers for differentiation, detection and differentiation of a colon cell proliferative disorder. The use of one or more genes selected from the methylated regions described herein, and their promoter or regulatory elements, can be enabled by analyzing the methylation status of the genes.
The methods and systems of the present disclosure may include analyzing the methylation status of CpG dinucleotides in one or more genomic sequences based on the methylation regions described herein and the sequences complementary thereto.
I. Definition of
As used in the specification and in the claims, the singular form of "a", "an", and "the" include plural referents unless the context clearly dictates otherwise. For example, the term "nucleic acid" includes a plurality of nucleic acids, including mixtures thereof.
As used herein, the term "object" generally refers to an entity or medium having testable or detectable genetic information. The subject may be a person, an individual, or a patient. The subject may be a vertebrate, such as a mammal, for example. Non-limiting examples of mammals include humans, simians, farm animals, sport animals, rodents, and pets. The subject may be a human having cancer or suspected of having cancer. The subject may exhibit symptoms indicative of the health or physiological state or condition of the subject, such as cancer or other diseases, disorders, or conditions of the subject. Alternatively, the subject may be asymptomatic for such a healthy or physiological state or condition.
As used herein, the term "sample" generally refers to a biological sample obtained or derived from one or more subjects. The biological sample may be a cell-free biological sample or a substantially cell-free biological sample, or may be processed or fractionated to produce a cell-free biological sample. For example, a cell-free biological sample can include cell-free ribonucleic acid (cfRNA), cell-free deoxyribonucleic acid (cfDNA), cell-free fetal DNA (cffDNA), plasma, serum, urine, saliva, amniotic fluid, and derivatives thereof. Ethylenediaminetetraacetic acid (EDTA) collection tubes, cell-free RNA collection tubes (e.g.,
Figure BDA0003971977860000161
) Or cell-free DNA collection tubes (e.g.
Figure BDA0003971977860000162
) Obtaining or deriving a cell-free biological sample from a subject. Cell-free biological samples can be derived from whole blood samples by fractionation (e.g., centrifugation of the cell-forming and cell-free components). The biological sample or derivative thereof may contain cells. For example, the biological sample may be a blood sample or a derivative thereof (e.g., blood collected by a collection tube or drop of blood).
As used herein, the term "nucleic acid" generally refers to a polymeric form of nucleotides of any length, whether deoxyribonucleotides (dntps) or ribonucleotides (rntps), or analogs thereof. The nucleic acid may have any three-dimensional structure and may perform any known or unknown function. Non-limiting examples of nucleic acids include deoxyribonucleic acid (DNA), ribonucleic acid (RNA), coding or non-coding regions of a gene or gene fragment, loci (locas) defined from linkage analysis, exons, introns, messenger RNA (mRNA), transfer RNA, ribosomal RNA, short interfering RNA (siRNA), short hairpin RNA (shRNA), micro RNA (miRNA), ribozymes, cDNA, recombinant nucleic acids, branched nucleic acids, plasmids, vectors, isolated DNA of any sequence, isolated RNA of any sequence, nucleic acid probes, and primers. The nucleic acid may comprise one or more modified nucleotides, such as methylated nucleotides and nucleotide analogs. Modifications to the nucleotide structure, if present, may be imparted before or after nucleic acid assembly. The nucleotide sequence of the nucleic acid may be interrupted by non-nucleotide components. The nucleic acid may be further modified after polymerization, such as by conjugation or binding to a reporter factor.
As used herein, the term "target nucleic acid" generally refers to a nucleic acid molecule in an initial population of nucleic acid molecules whose presence, number, and/or sequence of nucleotide sequences, or changes in one or more thereof, need to be determined. The target nucleic acid can be any type of nucleic acid, including DNA, RNA, and the like. As used herein, "target ribonucleic acid (RNA)" generally refers to a target nucleic acid that is an RNA. As used herein, "target deoxyribonucleic acid (DNA)" generally refers to a target nucleic acid that is DNA.
As used herein, the terms "amplifying" and "amplification" generally refer to increasing the size or number of nucleic acid molecules. The nucleic acid molecule may be single-stranded or double-stranded. Amplification may include the generation of one or more copies of a nucleic acid molecule or "amplification product". Amplification can be performed, for example, by extension (e.g., primer extension) or ligation. Amplification may include performing a primer extension reaction to generate a strand complementary to a single-stranded nucleic acid molecule, and in some cases, one or more copies of the strand and/or single-stranded nucleic acid molecule. The term "DNA amplification" generally refers to the generation of one or more copies of a DNA molecule or "amplified DNA product. The term "reverse transcription amplification" generally refers to the production of deoxyribonucleic acid (DNA) from a ribonucleic acid (RNA) template by the action of a reverse transcriptase
As used herein, the term "cell-free nucleic acid (cfNA)" generally refers to a nucleic acid in a biological sample that is not contained in a cell, such as cell-free RNA ("cfRNA") or cell-free DNA ("cfDNA"). cfDNA can circulate freely in body fluids, such as in the bloodstream.
As used herein, the term "cell-free sample" generally refers to a biological sample that is substantially devoid of intact cells. This may be derived from a biological sample which itself is substantially devoid of cells, or may be derived from a sample from which cells have been removed. Examples of cell-free samples include those derived from blood, such as serum or plasma; (ii) urine; or samples derived from other sources such as semen, sputum, feces, catheter exudate, lymph, or recovered lavage fluid.
As used herein, the term "circulating tumor DNA" generally refers to cfDNA derived from a tumor.
As used herein, the term "genomic region" generally refers to identified regions of nucleic acids that are identified by their location in the chromosome. In some examples, a genomic region is referred to by one gene name and encompasses both coding and non-coding regions associated with a physical region of a nucleic acid. As used herein, a gene comprises coding regions (exons), non-coding regions (introns), transcriptional control regions or other regulatory regions, and promoters. In another example, a genomic region may incorporate an intron or exon or an intron/exon boundary within a named gene.
As used herein, the term "CpG island" generally refers to a contiguous region of genomic DNA that meets the following criteria: (1) The frequency of CpG dinucleotides corresponding to the "observed/expected ratio" is greater than about 0.6; and (2) "GC content" greater than about 0.5.CpG islands are typically (but not always) between 0.2 to 3 kilobases (kb) in length, with a high frequency of CpG sites. CpG islands are found at or near the promoters of about 40% of mammalian genes. CpG islands are also found outside mammalian genes. In some examples, cpG islands are found in exons, introns, promoters, enhancers, repressors, and transcriptional regulatory elements. CpG islands may be targeted to appear upstream of so-called "housekeeping genes". The CpG dinucleotide content of CpG islands is said to be at least about 60% of that statistically expected. The appearance of CpG islands at the 5' end or upstream of a gene may reflect a role in transcriptional regulation, and methylation of CpG sites within the gene promoter may lead to silencing. Conversely, silencing of tumor suppressors by methylation is a hallmark of many human cancers.
As used herein, the term "CpG shore" generally refers to a short distance region extending outward from a CpG island where methylation may also occur. CpG banks can be found in the region of about 0 to 2kb upstream and downstream of CpG islands.
As used herein, the term "CpG scaffold" generally refers to a short distance region extending from a CpG shore where methylation may also occur. CpG scaffolds are typically found in regions between about 2kb and 4kb upstream and downstream of CpG islands (e.g., a further 2kb extension outward from CpG shore).
As used herein, the term "colon cell proliferative disorder" generally refers to a disorder or disease comprising a disorder or abnormal proliferation of colon or rectal cells. In some examples, the disorder is selected from: adenomas (adenomatous polyps), sessile Serrated Adenomas (SSA), advanced adenomas, colorectal dysplasias, colorectal adenomas, colorectal cancer, colon cancer, rectal cancer, colorectal epithelial carcinoma, colorectal adenocarcinoma, carcinoid tumors, gastrointestinal stromal tumors (GIST), lymphomas, and sarcomas. In some embodiments, the colon cell proliferative disorder comprises colorectal cancer.
As used herein, the term "epigenetic parameter" generally refers to cytosine methylation. Further epigenetic parameters include, for example, acetylation of histones, which, although they may not be directly analyzable using the described methods, are inversely related to DNA methylation.
As used herein, the term "genetic parameter" generally refers to mutations and polymorphisms of a gene and sequences further required for its regulation. Examples of mutations include insertions, deletions, point mutations, inversions, and polymorphisms, such as SNPs (single nucleotide polymorphisms).
As used herein, the term "hemi-methylation" or "hemi-methylation" generally refers to the methylation status of a palindromic CpG methylation site in which only one cytosine in one of the two CpG dinucleotide sequences of the palindromic CpG methylation site is methylated(e.g., 5' -CC) M GG-3' (Upper chain): 3'-GGCC-5' (bottom strand)).
As used herein, the term "hypermethylation" generally refers to an average methylation state corresponding to an increase in the presence of 5-mC at one or more CpG dinucleotides in a DNA sequence of a test DNA sample relative to the amount of 5-mC seen at the corresponding CpG dinucleotides in a normal control DNA sample. In some embodiments, the test DNA sample is from an individual having a colon cell proliferative disorder.
As used herein, the term "hypomethylation" generally refers to an average methylation state corresponding to a reduction in the presence of 5-mC at one or more CpG dinucleotides in a DNA sequence of a test DNA sample relative to the amount of 5-mC seen at the corresponding CpG dinucleotide in a normal control DNA sample. In some embodiments, the test DNA sample is from an individual having a colon cell proliferative disorder.
As used herein, the term "methylation state" or "methylation status" generally refers to the presence or absence of 5-methylcytosine ("5-mC") at one or more CpG dinucleotides in a DNA sequence. The methylation status of one or more specific CpG palindromic methylation sites (two CpG dinucleotide sequences per site) in a DNA sequence includes "unmethylated", "fully methylated" and "hemimethylated".
As used herein, the term "methylated cytosine" refers generally to any methylated form of the nucleobase cytosine in which a methyl or hydroxymethyl functionality is contained at the 5' position. Methylated cytosines are known to be regulators of gene transcription in genomic DNA. This term may include 5-methylcytosine and 5-hydroxymethylcytosine.
The term "methylation assay" as used herein generally refers to any assay for determining the methylation status of one or more CpG dinucleotide sequences within a DNA sequence.
As used herein, the term "minimal residual disease" or "MRD" generally refers to a small number of cancer cells in the body following cancer treatment. MRD testing may be performed to determine whether cancer treatment is effective and to guide further treatment planning.
As used herein, the term "MSP" (methylation specific Polymerase Chain Reaction (PCR)) refers generally to methylation assays such as those described by hermanoherman et al proc.natl.acad.sci.usa 93 9821-9826,1996 and U.S. patent No. 5,786,146 (the contents of each of which are incorporated herein by reference).
As used herein, the term "methylation converted" or "converted" nucleic acid generally refers to a nucleic acid, such as, for example, DNA, that has undergone a DNA conversion process for methylation sequencing. Examples of conversion processes include reagent-based (such as bisulfite) conversion, enzymatic conversion, or combinatorial conversion (such as TET-assisted pyridine borane sequencing (TAPS) conversion), where unmethylated cytosines are converted to uracil prior to PCR amplification or sequencing. The transformation process can be used in a methyl sequencing method to distinguish between methylated and unmethylated cytosine bases.
As used herein, the term "region of methylation in cancer" generally refers to a segment of the genome that contains a methylation site (CpG dinucleotide), the methylation of which is associated with a malignant cellular state. Methylation of a region can be associated with more than one different type of cancer, or specifically with one type of cancer. In addition, methylation of a region can be associated with more than one cancer subtype, or specifically with one cancer subtype.
The terms "type" and "subtype" of cancer are generally used herein in relative terms, whereby a "type" of cancer, such as breast cancer, may be a "subtype" based on, for example, stage, morphology, histology, gene expression, receptor profile, mutation profile, aggressiveness, prognosis, malignancy characteristics, and the like. Likewise, "type" and "subtype" may be applied at a finer level, e.g., to distinguish a histological "type" as "subtype", e.g., defined by mutation profile or gene expression. Cancer "stage" is also used to refer to classification of cancer types based on histological and pathological features associated with disease progression.
Analysis of samples
The cell-free biological sample may be obtained or derived from a human subject. The cell-free biological sample may be stored under different storage conditions, such as different temperatures (e.g., room temperature, refrigerated or frozen conditions, 25 ℃, 4 ℃, -18 ℃, -20 ℃, or-80 ℃) or different suspensions (e.g., an EDTA collection tube, a cell-free RNA collection tube, or a cell-free DNA collection tube) prior to processing.
The cell-free biological sample can be obtained from a subject having cancer, a subject suspected of having cancer, or a subject that does not have or is not suspected of having cancer.
The cell-free biological sample may be collected before and/or after treatment of the cancer subject. During a treatment or treatment regimen, a cell-free biological sample can be obtained from a subject. Multiple cell-free biological samples can be obtained from a subject to monitor the effect of treatment over time. A cell-free biological sample can be taken from a subject known or suspected to have cancer, and the subject cannot be diagnosed positively or negatively by clinical trials. The sample may be taken from a subject suspected of having cancer. Cell-free biological samples can be taken from subjects presenting with unexplained symptoms such as fatigue, nausea, weight loss, pain, weakness, or bleeding. The cell-free biological sample may be taken from a subject with an explained symptom. The cell-free biological sample may be taken from a subject at risk for developing cancer due to factors such as family history, age, hypertension or pre-hypertension, diabetes or pre-diabetes, overweight or obesity, environmental exposure, lifestyle risk factors (e.g., smoking, drinking, or drug abuse), or the presence of other risk factors.
The cell-free biological sample may comprise one or more analytes that may be analyzed, such as cell-free ribonucleic acid (cfRNA) molecules suitable for analysis to generate transcriptome data, cell-free deoxyribonucleic acid (cfDNA) molecules suitable for analysis to generate genomic data, or a mixture or combination thereof. One or more such analytes (e.g., cfRNA molecules and/or cfDNA molecules) can be isolated or extracted from one or more cell-free biological samples of a subject for downstream analysis using one or more suitable assays.
After obtaining a cell-free biological sample from a subject, the cell-free biological sample can be processed to generate a data set indicative of cancer in the subject. For example, the nucleic acid molecules of the cell-free biological sample are assessed for presence, absence, or quantification at a locus panel of the cancer-associated genome (e.g., a quantitative measure of RNA transcript or DNA at the cancer-associated genomic locus). In some embodiments, processing of a cell-free biological sample obtained from a subject may comprise: (i) Subjecting the cell-free biological sample to conditions sufficient to isolate, enrich, or extract a plurality of nucleic acid molecules; and (ii) analyzing the plurality of nucleic acid molecules to generate a data set.
In some embodiments, a plurality of nucleic acid molecules are extracted from a cell-free biological sample and sequenced to generate a plurality of sequencing reads. The nucleic acid molecule may comprise ribonucleic acid (RNA) or deoxyribonucleic acid (DNA). Nucleic acid molecules (e.g., RNA or DNA) can be extracted from cell-free biological samples by a variety of methods, such as from MP
Figure BDA0003971977860000211
Is/are as follows
Figure BDA0003971977860000212
Kit protocol from
Figure BDA0003971977860000213
Is/are as follows
Figure BDA0003971977860000214
DNA cell-free biological Mini kit, or from Norgen
Figure BDA0003971977860000215
The kit scheme of cell-free biological DNA isolation. The extraction method may be to extract all RNA or DNA molecules from the sample. Alternatively, the extraction method may selectively extract a portion of the RNA or DNA molecules from the sample. RNA molecules extracted from a sample can be converted into DNA molecules by Reverse Transcription (RT).
Sequencing can be performed by any suitable sequencing method, such as Massively Parallel Sequencing (MPS), paired-end sequencing, high-throughput sequencing, next Generation Sequencing (NGS), shotgun sequencing, single molecule sequencing, nanopore sequencing, semi-molecular sequencingConductor sequencing, pyrosequencing, sequencing By Synthesis (SBS), sequencing by ligation, sequencing by hybridization and
Figure BDA0003971977860000221
sequencing may include nucleic acid amplification (e.g., RNA or DNA molecules). In some embodiments, the nucleic acid amplification is a Polymerase Chain Reaction (PCR). An appropriate number of rounds of PCR (e.g., PCR, qPCR, reverse transcriptase PCR, digital PCR, etc.) can be performed to sufficiently amplify an initial amount of nucleic acid (e.g., RNA or DNA) to a desired input for subsequent sequencing. In some cases, PCR can be used for bulk amplification of a target nucleic acid. This may involve the use of aptamer sequences that can first be ligated to different molecules and then subjected to PCR amplification using universal primers. PCR can be performed using any of a number of commercial kits, e.g., by Life
Figure BDA0003971977860000222
Etc. In other cases, only certain target nucleic acids within a population of nucleic acids can be amplified. Specific primers (possibly in combination with aptamer ligation) can be used to selectively amplify certain targets for downstream sequencing. PCR may include targeted amplification of one or more genomic loci, such as genomic loci associated with cancer. Sequencing may include the use of simultaneous Reverse Transcription (RT) and Polymerase Chain Reaction (PCR), such as by
Figure BDA0003971977860000223
Thermo Fisher
Figure BDA0003971977860000225
Or
Figure BDA0003971977860000224
The provided OneStep RT-PCR kit scheme.
RNA or DNA molecules isolated or extracted from a cell-free biological sample may be labeled, for example, with an identifiable label to allow multiplexing of multiple samples. Any number of RNA or DNA samples can be multiplexed. For example, the multiplexed reaction may comprise RNA or DNA from at least about 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, or more than 100 initial cell-free biological samples. For example, a plurality of cell-free biological samples can be labeled with a sample barcode such that each DNA molecule can be traced back to the sample (and object) from which the DNA molecule originated. Such tags can be ligated to RNA or DNA molecules by ligation or primer PCR amplification.
After sequencing the nucleic acid molecules, the sequence reads can be subjected to appropriate bioinformatic processing to generate data indicative of the presence, absence or relative assessment of cancer. For example, the sequence reads can be aligned to one or more reference genomes (e.g., genomes of one or more species, such as the human genome, e.g., hg 19). Aligned sequence reads can be quantified at one or more genomic loci to generate a data set indicative of cancer. For example, quantifying sequences corresponding to a plurality of genomic loci associated with cancer can generate a data set indicative of cancer.
The cell-free biological sample can be processed without any nucleic acid extraction. For example, a cancer in a subject can be identified or monitored by using probes configured to selectively enrich for nucleic acid (e.g., RNA or DNA) molecules corresponding to a plurality of cancer-associated genomic loci. The probe may be a nucleic acid primer. The probe may have sequence complementarity with a nucleic acid sequence from one or more of a plurality of cancer-associated genomic loci or genomic regions. The plurality of cancer-associated genomic loci or genomic regions can comprise at least 2, at least 3, at least 4, at least 5, at least 6, at least 7, at least 8, at least 9, at least 10, at least 11, at least 12, at least 13, at least 14, at least 15, at least 16, at least 17, at least 18, at least 19, at least 20, at least about 25, at least about 30, at least about 35, at least about 40, at least about 45, at least about 50, at least about 55, at least about 60, at least about 65, at least about 70, at least about 75, at least about 80, at least about 85, at least about 90, at least about 95, at least about 100, or more different cancer-associated genomic loci or genomic regions. The plurality of cancer-associated genomic loci or genomic regions can comprise one or more members selected from the groups listed in tables 1-11 (e.g., 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, about 25, about 30, about 35, about 40, about 45, about 50, about 55, about 60, about 65, about 70, about 75, about 80, or more). A cancer-associated genomic locus or genomic region may be associated with different stages or subtypes of cancer (e.g., colorectal cancer).
The probe can be a nucleic acid molecule (e.g., RNA or DNA) having sequence complementarity to a nucleic acid sequence (e.g., RNA or DNA) of one or more genomic loci (e.g., a cancer-associated genomic locus). These nucleic acid molecules may be primers or enrichment sequences. Analysis of a cell-free biological sample using a probe that is selective for one or more genomic loci (e.g., cancer-associated genomic loci) can include the use of array hybridization (e.g., microarray-based), polymerase Chain Reaction (PCR), or nucleic acid sequencing (e.g., RNA sequencing or DNA sequencing). In some embodiments, DNA or RNA can be analyzed by one or more of: DNA/RNA isothermal amplification methods (e.g., loop-mediated isothermal amplification (LAMP), helicase-dependent amplification (HDA), rolling Circle Amplification (RCA), recombinase Polymerase Amplification (RPA)), immunoassays, electrochemical assays, surface Enhanced Raman Spectroscopy (SERS), quantum Dot (QD) -based assays, molecular inversion probes, droplet digital PCR (ddPCR), CRISPR/Cas-based detection (e.g., CRISPR typing PCR (ctPCR), specific high-sensitivity enzyme reporter unlock (SHERLOCK), DNA endonuclease-targeted CRISPR trans-reporter (detectrr), and CRISPR-mediated analog multiple event recorder (CAMERA)), and Laser Transmission Spectroscopy (LTS).
The assay readout can be quantified at one or more genomic loci (e.g., cancer-associated genomic loci) to generate data indicative of cancer. For example, array hybridization or quantification of Polymerase Chain Reaction (PCR) corresponding to a plurality of genomic loci (e.g., cancer-associated genomic loci) can generate data indicative of cancer. Assay readout can include quantitative PCR (qPCR) values, digital PCR (dPCR) values, digital droplet PCR (ddPCR) values, fluorescence values, and the like, or normalized values thereof. The metering may be a home user test configured to be performed in a home environment.
In some embodiments, multiple assays can be used to simultaneously treat cell-free biological samples of a subject. For example, a first assay may be used to process a first cell-free biological sample obtained or derived from a subject to generate a first data set indicative of cancer; and a second assay, different from the first assay, may be used to process a second cell-free biological sample obtained or derived from the subject to generate a second data set indicative of cancer. Any or all of the first data set and the second data set may then be analyzed to assess the cancer of the subject. For example, a single diagnostic index or diagnostic score may be generated based on a combination of the first data set and the second data set. As another example, separate diagnostic indicators or diagnostic scores may be generated from the first data set and the second data set.
Cell-free biological samples can be processed using methylation specific assays. For example, methylation-specific assays can be used to identify quantitative measures of methylation (e.g., indicative of the presence, absence, or relative quantity) at each of a plurality of cancer-associated genomic loci in a cell-free biological sample of a subject. Methylation specific assays can be configured to process cell-free biological samples, such as blood samples or urine samples (or derivatives thereof) of a subject. A quantitative measure of cancer-associated genomic locus methylation (e.g., indicative of the presence, absence, or relative quantity) in a cell-free biological sample can be indicative of one or more cancers. Methylation-specific assays can be used to generate a data set to indicate a quantitative measure of methylation (e.g., indicative of the presence, absence, or relative quantity) at each of a plurality of cancer-associated genomic loci in a cell-free biological sample of a subject.
For example, methylation specific assays can include one or more of: methylation-aware sequencing (e.g., using bisulfite treatment), pyrosequencing, methylation-sensitive single-stranded conformation analysis (MS-SSCA), high resolution melting analysis (HRM), methylation-sensitive single nucleotide primer extension (MS-SnuPE), base-specific cleavage/MALDI-TOF, microarray-based methylation assays, methylation-specific PCR, targeted bisulfite sequencing, oxidative bisulfite sequencing, mass spectrometry-based bisulfite sequencing, or degenerate representative bisulfite sequencing (RRBS).
III. signature panel
The present disclosure provides methods and systems for analyzing biological samples to obtain measurable characteristics from a combination of hypermethylated regions in DNA associated with the development of a colonic cell proliferative disorder in the sample, thereby identifying signature panels for the regions. Features from the signature panel may be processed using a trained algorithm (e.g., a machine learning model) to create a classifier configured for stratifying an individual population of colonic cell proliferative disorders. The method is characterized by using one or more nucleic acids having a methylated region as described in the signature panel, which are contacted prior to sequencing with one or a series of reagents capable of distinguishing methylated from unmethylated CpG dinucleotides within the identified region.
Signature panels as described herein generally refer to a collection of genomic DNA-targeted regions identified in a cell-free nucleic acid sample and exhibiting increased cytosine base methylation in samples associated with a colonic cell proliferative disorder. The formation of signature panels allows for rapid and specific analysis of specific methylated regions associated with colonic cell proliferative disorders. Signature panels described and used in the methods herein can be used to improve the diagnosis, prognosis, treatment selection, and monitoring (e.g., treatment monitoring) of colonic cell proliferative disorders.
The signature panels and methods of the present disclosure can provide significant improvements over current methods in addressing the need for markers or signature panels used in the detection of early stage colonic cell proliferative disorders from bodily fluid samples such as whole blood, plasma or serum. Current methods for detecting and diagnosing colon cell proliferative disorders include colonoscopy, sigmoidoscopy, and fecal occult blood colon cancer. In contrast to these methods, the methods provided herein can be much less invasive than colonoscopy and at least as sensitive or more sensitive as sigmoidoscopy, fecal Immunochemical Test (FIT) and Fecal Occult Blood Test (FOBT). The methods provided herein can have significant advantages over the currently used markers in terms of sensitivity and specificity due to the advantageous combination of using gene panels with highly sensitive assay techniques.
In some embodiments, the region of methylation in the cancer comprises a CpG island. In some embodiments, the region of methylation in the cancer comprises a CpG shore. In some embodiments, the region of methylation in the cancer comprises a CpG scaffold. In some embodiments, the region of methylation in the cancer comprises a CpG island and a CpG shore. In some embodiments, the regions that are methylated in cancer include CpG islands, cpG banks, and CpG scaffolds.
In some embodiments, the region of methylation in the cancer includes CpG islands and sequences of about 0 to 4 kilobases (kb) upstream and downstream. Regions of methylation in cancer may also include CpG islands and the following sequences: upstream and downstream about 0 to 3kb, upstream and downstream about 0 to 2kb, upstream and downstream about 0 to 1kb, upstream and downstream about 0 to 500 base pairs (bp), upstream and downstream about 0 to 400bp, upstream and downstream about 0 to 300bp, upstream and downstream about 0 to 200bp, or upstream and downstream about 0 to 100bp.
According to some examples, a number of design parameters may be considered in selecting hypermethylated regions in cancer. In certain examples, the methylated region is about 200bp, about 300bp, about 400bp, or about 500bp in length. Data for this selection process can be obtained from a variety of sources, such as, for example, the Cancer Genome Atlas (TCGA), by using, for example, the method for a wide variety of cancers
Figure BDA0003971977860000261
Infinium HumanMethylation450 BeadChip, or from other sources based on bisulfite whole genome sequencing or other methods. In some embodiments, regions can be selected using "methylation values" (which can be derived from TCGA 3-grade methylation data, which in turn is derived from a β value of about-0.5 to 0.5). In some embodiments, amplification is performed with a primer set designed to amplify at least one methylation siteNormally less than about-0.3. This can be established in a plurality of normal tissue samples, such as about 4. The methylation value can be equal to or less than about-0.1, about-0.2, about-0.3, about-0.4, about-0.5, about-0.6, about-0.7, about-0.8, about-0.9, or about-1.0.
In some embodiments, the primer set is designed to amplify at least one methylation site whose difference between the average methylation value in cancer tissue and normal tissue is greater than a predefined threshold, such as about 0.3. In some embodiments, the difference may be greater than about 0.1, about 0.2, about 0.3, about 0.4, about 0.5, about 0.6, about 0.7, about 0.8, about 0.9, or about 1.0. In some examples, the proximity of other methylation sites that meet this requirement can also play a role in selecting regions. In some embodiments, the primer set comprises a primer pair that amplifies at least one methylation site, the primer pair having at least one methylation site within about 200bp, a methylation value also under normal conditions of about-0.3, and a difference between the average methylation value in cancer tissue and normal tissue of greater than about 0.3.
In some examples, a target region is selected if the methylation of the same region is greater in a sample obtained or derived from one or more healthy individuals (e.g., individuals without cancer). This selection may be performed manually or computationally. In certain examples, a region is selected if it has at least about 5%, about 10%, about 15%, about 20%, about 30%, about 40%, about 50%, about 55%, about 60%, about 65%, about 70%, about 75%, about 80%, about 85%, about 90%, about 95%, about 100%, or more than about 100% more methylation as compared to a sample from a healthy individual. In another example, a region may be selected if the number of reads in a disease sample that map to the same region at a predefined threshold methylated CpG count exceeds the same predefined threshold methylated CpG count for the same region in a healthy individual sample. For a given region, the methylated CpG count used as the baseline threshold in a healthy sample may vary, but readings mapped to that region that exceed the baseline threshold for methylated CpG count for that region in a healthy sample may indicate an important region despite fluctuations in the threshold CpG count.
In some instances, the target region may be selected for amplification based on verifying the number of samples concentrated at the site with methylation. For example, a region may be selected if the degree of methylation is higher for at least about 5%, about 10%, about 15%, about 20%, about 25%, about 30%, about 35%, about 40%, about 45%, about 50%, about 55%, about 60%, about 65%, about 70%, about 75%, about 80%, about 85%, about 90%, about 95%, about 96%, about 97%, about 98%, or about 99% of the sample tested from the diseased individual as compared to the sample from a healthy individual. For example, regions may be selected if they are methylated in at least about 75% of the tested tumors (included within a particular subtype). For some validation, tumor-derived cell lines were available for testing.
The present disclosure also provides a method for performing an assay to determine genetic and/or epigenetic parameters of one or more genes selected from the signature panels described herein and their promoters and regulatory elements. In some embodiments, the assay according to the following method is performed in order to detect methylation within one or more genes selected from signature panels described herein, wherein the methylated nucleic acid is present in a solution further comprising an excess of background DNA, wherein the background DNA is present at a concentration of DNA to be detected of about 100 to 1000-fold, about 100 to 10000-fold, about 100 to 100000-fold, about 1000 to 10000-fold, about 1000 to 100000-fold, or about 10000 to 100000-fold. In some embodiments, the concentration of DNA to be detected is greater than about 100000 times the background DNA concentration. In some embodiments, the methods comprise contacting a nucleic acid sample obtained from a subject with at least one reagent or a series of reagents (e.g., reagents that distinguish methylated from unmethylated CpG dinucleotides within a target nucleic acid).
The tumor or colon cell proliferative disorder as described herein may be selected from: adenomas (adenomatous polyps), sessile Serrated Adenomas (SSA), advanced adenomas, colorectal dysplasias, colorectal adenomas, colorectal cancer, colon cancer, rectal cancer, colorectal epithelial carcinoma, colorectal adenocarcinoma, carcinoid tumors, gastrointestinal stromal tumors (GIST), lymphomas, and sarcomas. In some embodiments, the colonic cell proliferative disorder comprises colorectal cancer.
Signature panels comprising regions of information methylation can be selected according to the purpose of the intended assay. For targeted methods, primer pairs can be designed based on the desired set of target regions. In some embodiments, the set of regions comprises at least one, at least two, at least three, or more than three regions listed in table 1. In some embodiments, the set of regions comprises all of the regions listed in table 1.
In some embodiments, the set of methyl regions associated with colorectal cancer is selected from table 1.
In some embodiments, the cancer panel comprises at least one, at least two, at least three, or more than three regions selected from: ITGA4, EMBP1, TMEM163, SFMBT2, ELMO1, ZNF543, SFMBT2, CHST10, CCNA1, BEND4, KRBA1, S1PR1, PPP1R16B, IKZF, LONRF2, ZFP82, and FLT3 (e.g., where the tumor is colorectal cancer). In some embodiments, the cancer panel comprises all of the regions listed in table 1. In some embodiments, the probes are directed to a sequence selected from at least one, at least two, at least three, or more than three of: ITGA4, EMBP1, TMEM163, SFMBT2, ELMO1, ZNF543, SFMBT2, CHST10, CCNA1, BEND4, KRBA1, S1PR1, PPP1R16B, IKZF, LONRF2, ZFP82 and FLT3.
TABLE 1
Figure BDA0003971977860000291
Figure BDA0003971977860000301
In some embodiments, the method further comprises quantifying a methylation signal, wherein a value that exceeds a predetermined threshold is indicative of a colon cell proliferative disorder. In some embodiments, the quantification and comparison of each methylation site in a proliferative disorder of colon cells is performed independently. Thus, a count of tumor positive signals can be established for each site. In some embodiments, the method further comprises determining a proportion of sequencing reads comprising tumor signal, wherein a proportion exceeding a threshold value is indicative of a colon cell proliferative disorder. In some embodiments, the determination of each methylation site in a colon cell proliferative disorder is performed independently.
As used herein, the term "threshold" generally refers to a value selected to distinguish, separate, or differentiate two populations of objects. In some embodiments, the threshold distinguishes methylation status as disease (e.g., malignant) status from non-disease (e.g., healthy) status. In some embodiments, the threshold may distinguish between different stages of the disease (e.g., stage 1, 2, 3, or 4). The threshold may be set based on the disease concerned and may be determined based on an earlier analysis, such as an analysis of a training set, or calculated based on a set of inputs having known characteristics (e.g., health, disease or disease stage). A threshold value can also be set for a gene region based on the predicted methylation at a particular site. The threshold for each methylation site can be different, and the data for multiple sites can be combined in the final analysis.
In some embodiments of the above methods, the cancer panel comprises at least one, at least two, at least three, or more than three regions selected from: ITGA4, TMEM163, SFMBT2, ELMO1, ZNF543, CHST10, CCNA1, BEND4, KRBA1, S1PR1 and PPP1R16B (e.g., wherein the tumor is colorectal cancer). In some embodiments, the cancer panel comprises one or more regions listed in table 2. In some embodiments, the probes are directed to a sequence selected from at least one, at least two, at least three, or more than three of: ITGA4, TMEM163, SFMBT2, ELMO1, ZNF543, CHST10, CCNA1, BEND4, KRBA1, S1PR1 and PPP1R16B.
TABLE 2
Methyl region (Gene ID; chromosome: location Start-location nodeBunch)
ITGA4;chr2:181457004-181457950
TMEM163;chr2:134718243-134719428
SFMBT2;chr10:7408046-7408953
ELMO1;chr7:37448612-37449471
ZNF543;chr19:57320164-57320845
SFMBT2;chr10:7410025-7411008
CHST10;chr2:100417269-100417795
ELMO1;chr7:37447852-37448217
CCNA1;chr13:36431498-36432414
BEND4;chr4:42150707-42153216
KRBA1;chr7:149714695-149715338
S1PR1;chr1:101236505-101237190
PPP1R16B;chr20:38805341-38807221
In some embodiments, the cancer panel comprises at least one, at least two, at least three, or more than three regions selected from: EMBP1, TMEM163, SFMBT2, ELMO1, ZNF543, CHST10, CCNA1, BEND4, KRBA1, S1PR1 and PPP1R16B (e.g., wherein the tumor is colorectal cancer). In some embodiments, the cancer panel comprises one or more regions listed in table 3. In some embodiments, the probes are directed to a sequence selected from at least one, at least two, at least three, or more than three of: EMBP1, TMEM163, SFMBT2, ELMO1, ZNF543, CHST10, CCNA1, BEND4, KRBA1, S1PR1 and PPP1R16B.
TABLE 3
Methyl region (Gene ID; chromosome: start of position-end of position)
EMBP1;chr1:121519076-121519744
TMEM163;chr2:134718243-134719428
SFMBT2;chr10:7408046-7408953
ELMO1;chr7:37448612-37449471
ZNF543;chr19:57320164-57320845
SFMBT2;chr10:7410025-7411008
CHST10;chr2:100417269-100417795
ELMO1;chr7:37447852-37448217
CCNA1;chr13:36431498-36432414
BEND4;chr4:42150707-42153216
KRBA1;chr7:149714695-149715338
S1PR1;chr1:101236505-101237190
PPP1R16B;chr20:38805341-38807221
In some embodiments, the cancer panel comprises at least one, at least two, at least three, or more than three regions selected from: ITGA4, EMBP1, TMEM163, SFMBT2, ELMO1, ZNF543, CHST10, CCNA1, BEND4, KRBA1 and S1PR1, and the tumor is colorectal cancer. In some embodiments, the cancer panel comprises one or more regions listed in table 4. In some embodiments, the probes are directed to a sequence selected from at least one, at least two, at least three, or more than three of: ITGA4, EMBP1, TMEM163, SFMBT2, ELMO1, ZNF543, CHST10, CCNA1, BEND4, KRBA1 and S1PR1.
TABLE 4
Methyl region (Gene ID; chromosome: start of position-end of position)
ITGA4;chr2:181457004-181457950
EMBP1;chr1:121519076-121519744
TMEM163;chr2:134718243-134719428
SFMBT2;chr10:7408046-7408953
ELMO1;chr7:37448612-37449471
ZNF543;chr19:57320164-57320845
SFMBT2;chr10:7410025-7411008
CHST10;chr2:100417269-100417795
ELMO1;chr7:37447852-37448217
CCNA1;chr13:36431498-36432414
BEND4;chr4:42150707-42153216
KRBA1;chr7:149714695-149715338
S1PR1;chr1:101236505-101237190
In some embodiments, the cancer panel comprises at least one, at least two, at least three, or more than three regions selected from: ITGA4, EMBP1, TMEM163, SFMBT2, ELMO1 and ZNF543, and the tumor is colorectal cancer. In some embodiments, the cancer panel comprises the regions listed in table 5. In some embodiments, the probes are directed to a sequence selected from at least one, at least two, at least three, or more than three of: ITGA4, EMBP1, TMEM163, SFMBT2, ELMO1 and ZNF5431.
TABLE 5
Methyl region (Gene ID; chromosome: start of position-end of position)
ITGA4;chr2:181457004-181457950
EMBP1;chr1:121519076-121519744
TMEM163;chr2:134718243-134719428
SFMBT2;chr10:7408046-7408953
ELMO1;chr7:37448612-37449471
ZNF543;chr19:57320164-57320845
In some embodiments, the cancer panel comprises one or more of the regions ITGA4 and EMBP1 (e.g., wherein the tumor is colorectal cancer). In some embodiments, the cancer panel comprises one or more regions listed in table 6. In some embodiments, the probe points to a sequence that includes ITGA4 and EMBP 1.
TABLE 6
Methyl region (Gene ID; chromosome: start of position-end of position)
ITGA4;chr2:181457004-181457950
EMBP1;chr1:121519076-121519744
In some embodiments of the above methods, the cancer panel comprises at least one, at least two, at least three, or more than three regions selected from: KZF1, KCNQ5, ELMO1, CHST2, PRKCB, FLI1, CLIP4, ELOVL5, FAM72B, ST GAL1, ZEB2 NR3C1, ITGA4, GALNT14, CHST11, PPP1R16B, MGAT, ZNF264, BEND4, IRF4, LOC100130992, CHST11, CHST15, RASSF2, EMILIN2, TMEM163, CHST10, and HCK (e.g., where the tumor is colorectal cancer). In some embodiments, the cancer panel comprises one or more regions listed in table 7. In some embodiments, the probes are directed to a sequence selected from at least one, at least two, at least three, or more than three of: IKZF1, KCNQ5, ELMO1, CHST2, PRKCB, FLI1, CLIP4, ELOVL5, FAM72B, ST GAL1, ZEB2 NR3C1, ITGA4, GALNT14, CHST11, PPP1R16B, MGAT, ZNF264, BEND4, IRF4, LOC100130992, CHST11, CHST15, RASSF2, EMILIN2, TMEM163, CHST10 and HCK.
TABLE 7
Figure BDA0003971977860000351
Figure BDA0003971977860000361
In some embodiments of the above methods, the cancer panel comprises at least one, at least two, at least three, or more than three regions selected from: IKZF1, KCNQ5, ELMO1, CHST2, PRKCB, FLI1, CLIP4, ELOVL5, FAM72B, ST GAL1, ZEB2 NR3C1, ITGA4, GALNT14, CHST11, PPP1R16B, MGAT, ZNF264, BEND4, and IRF4 (e.g., where the tumor is colorectal cancer). In some embodiments, the cancer panel comprises one or more regions listed in table 8. In some embodiments, the probes are directed to a sequence selected from at least one, at least two, at least three, or more than three of: IKZF1, KCNQ5, ELMO1, CHST2, PRKCB, FLI1, CLIP4, ELOVL5, FAM72B, ST GAL1, ZEB2 NR3C1, ITGA4, GALNT14, CHST11, PPP1R16B, MGAT, ZNF264, BEND4, and IRF4.
TABLE 8
Figure BDA0003971977860000371
Figure BDA0003971977860000381
In some embodiments of the above methods, the cancer panel comprises at least one, at least two, at least three, or more than three regions selected from: IKZF1, KCNQ5, ELMO1, CHST2, PRKCB, FLI1, CLIP4, ELOVL5, FAM72B and ST3GAL1 (e.g., where the tumor is colorectal cancer). In some embodiments, the cancer panel comprises one or more regions listed in table 9. In some embodiments, the probes are directed to a sequence selected from at least one, at least two, at least three, or more than three of: IKZF1, KCNQ5, ELMO1, CHST2, PRKCB, FLI1, CLIP4, ELOVL5, FAM72B and ST3GAL1.
TABLE 9
Figure BDA0003971977860000382
Figure BDA0003971977860000391
In some embodiments of the above methods, the cancer panel comprises at least one, at least two, at least three, or more than three regions selected from: IKZF1, KCNQ5, ELMO1, CHST2, PRKCB and FLI1 (e.g., wherein the tumor is colorectal cancer). In some embodiments, the cancer panel comprises one or more regions listed in table 10. In some embodiments, the probes are directed to a sequence selected from at least one, at least two, at least three, or more than three of: IKZF1, KCNQ5, ELMO1, CHST2, PRKCB and FLI1.
Watch 10
Methyl region (Gene ID; chromosome: start of position-end of position)
IKZF1;chr7:50303445-50305526
KCNQ5;chr6:72620772-72623556
ELMO1;chr7:37447220-37450201
CHST2;chr3:143118680-143121423
PRKCB;chr16:23835445-23837405
FLI1;chr11:128691887-128696541
In some embodiments of the above methods, the cancer panel comprises a region selected from at least one, at least two, or at least three of: IKZF1, KCNQ5 and ELMO1 (e.g., wherein the tumor is colorectal cancer). In some embodiments, the cancer panel comprises one or more regions listed in table 11. In some embodiments, the probes are directed to a sequence selected from at least one, at least two, or at least three of: IKZF1, KCNQ5 and ELMO1.
TABLE 11
Methyl region (Gene ID; chromosome: start of position-end of position)
IKZF1;chr7:50303445-50305526
KCNQ5;chr6:72620772-72623556
ELMO1;chr7:37447220-37450201
In one aspect, the present disclosure provides a method for identifying a methylation signature indicative of a biological trait, the method comprising: obtaining data for a population comprising a plurality of genomic methylation datasets associated with proliferative disease states of colon cells, each of the genomic methylation datasets being associated with biological information of a corresponding sample; separating the methylation dataset into a first grouping corresponding to one tissue or cell type having the biological trait and a second grouping corresponding to a plurality of tissue or cell types not having the biological trait; matching the first subset of methylation data to the second subset of methylation data site-by-site in the genome; identifying a set of CpG sites in the genome site-by-site that meet a predetermined threshold for establishing differential methylation between the first and second groupings; identifying a target genomic region comprising at least one, at least two, at least three, or more than three differentially methylated cpgs within about 30 to 300bp that meet the predetermined criteria using a set of CpG sites to identify a differentially methylated genomic region to provide a methylation signature indicative of a biological trait associated with the presence of a colonic cell proliferative disorder.
In some examples, the target genomic region comprises at least one, at least two, at least three, or more than three differentially methylated CpG sites within a region having a length of: about 30 to 150bp, about 40 to 150bp, about 50 to 150bp, about 75 to 150bp, about 100 to 150bp, about 150 to 300bp, about 150 to 250bp, about 150 to 200bp, about 200 to 300bp, or about 250 to 300bp.
In some examples, the target genomic region comprises at least four differential methylated CpG sites, at least five differential methylated CpG sites, at least six differential methylated CpG sites, at least seven differential methylated CpG sites, at least eight differential methylated CpG sites, at least nine differential methylated CpG sites, at least ten differential methylated CpG sites, at least 12 differential methylated CpG sites, or at least 15 differential methylated CpG sites.
In some embodiments, the method further comprises validating the extended target genomic region by detecting differential methylation within the extended target genomic region using DNA from at least one independent sample possessing the biological trait and DNA from at least one independent sample not possessing the biological sample.
In some embodiments, the identifying further comprises limiting the set of CpG sites to CpG sites that further exhibit differential methylation compared to peripheral blood mononuclear cells from the reference or control sample.
In some embodiments, the predetermined threshold is at least about 50% methylation in the first grouping.
In some embodiments, the predetermined threshold is an average methylation difference between the first packet and the second packet of at least about 0.3.
In some embodiments, the biological trait includes malignancy.
In some embodiments, the biological trait includes a type of cancer.
In some embodiments, the biological trait includes a stage of cancer.
In some embodiments, the biological trait comprises a cancer classification.
In some embodiments, the cancer classification comprises cancer staging.
In some embodiments, the cancer classification comprises a histological classification.
In some embodiments, the biological trait comprises a metabolic profile.
In some embodiments, the biological trait comprises a mutation.
In some embodiments, the mutation is a disease-associated mutation.
In some embodiments, the biological trait comprises a clinical outcome.
In some embodiments, the biological trait comprises a drug response.
In some embodiments, the method further comprises designing a plurality of PCR primer pairs to amplify portions of the extended target genomic region, each portion comprising at least one differentially methylated CpG site.
In some embodiments, the design of the plurality of primer pairs comprises converting unmethylated cytosines to uracils to mimic conversion of cytosines to uracils, and designing primer pairs using the converted sequences.
In some embodiments, the primer pair is designed to have a tendency to methylation.
In some embodiments, the primer pair is methylation specific.
In some embodiments, the primer pair has no CpG residues and no preference for methylation status.
In one aspect, the present disclosure provides a method for synthesizing a primer pair specific for a methylation signature, the method comprising: the methods of the disclosure are performed and the designed primer pairs are synthesized.
Nucleic acid transformation and methylation sequencing
A. Nucleic acid processing
Methylation sequencing can utilize a variety of methods, including chemical-based and enzymatic-based conversion of nucleic acid bases, to distinguish methylated cytosines from unmethylated cytosines in a nucleic acid sequence. These assays allow the methylation status of one or more CpG dinucleotides (e.g., cpG islands) within a DNA sequence to be determined. Such assays may include, inter alia, DNA sequencing of bisulfite-or enzyme-treated DNA, polymerase Chain Reaction (PCR) (for sequence-specific amplification), quantitative PCR (qPCR), or digital droplet PCR (ddPCR), southern blot analysis. In various examples, DNA in a biological sample is treated in such a way that a cytosine base that is unmethylated at the 5' -position is converted to uracil, thymine, or another base that differs from cytosine in hybridization behavior. This may be referred to as "conversion".
In some embodiments, the reagent converts a cytosine base that is unmethylated at the 5' -position to uracil, thymine, or another base that differs from cytosine in hybridization behavior.
Bisulfite modification of DNA generally refers to a tool for assessing CpG methylation status. A common method for analyzing DNA for the presence of 5-methylcytosine (5-mC) is based on the reaction of bisulfite with cytosine, which is converted to uracil by subsequent alkaline desulfurization, which corresponds to the base pairing behavior of thymine. For example, genomic sequencing has been adapted for analysis of DNA methylation patterns and 5-methylcytosine distribution by using bisulfite treatment (e.g., as described by Frommer et al, proc.natl.acad.sci.usa 89 1827-1831,1992, the contents of which are incorporated herein by reference. Notably, however, 5-methylcytosine remains unmodified under these conditions. Thus, the original DNA is transformed in such a way that methylcytosine (methyl-C), which was originally indistinguishable from cytosine by hybridization behavior, can now be detected as the only remaining cytosine by various molecular biological techniques, for example by amplification and hybridization or by sequencing. In various examples, other reagents can achieve the same results as bisulfite modification suitable for methylation sequencing.
One commonly used direct sequencing method employs PCR amplified bisulfite treated DNA, which is suitable for Whole Genome Bisulfite Sequencing (WGBS) or targeted bisulfite sequencing.
Targeted bisulfite sequencing may refer to a commercially available NGS method for assessing site-specific DNA methylation changes. The probes were designed to be strand-specific and bisulfite-specific. Both methylated and unmethylated sequences are amplified. The process is similar to pyrosequencing, but overall provides higher throughput. In some embodiments, the next generation sequencing platform is used to deliver large amounts of useful DNA methylation information (e.g., EPIGENTEK, farmingdale, NY and ZYMO RESEARCH, irvine, CA). Methylation analysis of single base resolution of single cytosines in DNA can be facilitated by bisulfite treatment of DNA, followed by PCR amplification of the target region, library construction, and sequencing of the amplicon region. Specific primers can be designed for the region of interest and the change in cytosine methylation within that region assessed. Each target DNA methylation site can be assessed at high sequencing coverage depth to obtain accurate, quantitative, and single base resolution data output.
Enzymatic methyl sequencing (EM-seq) can rely on enzymatic transformation of nucleic acids for genomic analysis. The data may suggest that the process of generating the EM-seq library does not disrupt DNA as does bisulfite sequencing. The EM-seq library, although using fewer PCR cycles for all DNA inputs, can achieve higher PCR yields, indicating that less DNA is lost during enzymatic processing and library preparation compared to Whole Genome Bisulfite Sequencing (WGBS). Conversely, reduced PCR cycles can be converted to more complex libraries and fewer PCR replicas during the sequencing process. The average insert size of the EM-seq library can also be larger than the WGBS, further supporting the fact that the DNA remains intact. In the EM-seq procedure, TET2 oxidizes 5-mC and 5-hmC, preventing APOBEC deamination in the next run. In contrast, unmodified cytosine is deaminated to uracil. In some embodiments, the targeted methods comprise enzymatic transformation of nucleic acids (TEM-seq). In some embodiments, the methylation sequencing method is performed using
Figure BDA0003971977860000441
Enzyme Methyl-seq (New England Biolabs, ipsworth, mass.), which is useful for the identification of 5mC and 5hmC.
In another example, 5hmC may also be sequenced using TET-assisted bisulfite (TAB-seq) (e.g., as described by Yu, m. Et al (2012). Nat. Protoc.7,2159-2170, the contents of which are incorporated herein by reference) (wisege;
Figure BDA0003971977860000442
) To detect. Fragment DNA can be enzymatically modified using continuous T4 bacteriophage beta-glucosyltransferase (T4-BGT) and then treated with 10-11 translocation (TET) dioxygenase prior to the addition of sodium bisulfite. T4-BGT glycosylates 5hmC to form beta-glucosyl-5-hydroxymethylcytosine (5 ghmC), then5mC was oxidized to 5caC with TET. Only 5ghmC was not subject to subsequent deamination by sodium bisulfite, which allowed 5hmC to be distinguished from 5mC by sequencing.
Oxidative bisulfite sequencing (oxBS) provides another method to distinguish 5mC from 5hmC (e.g., as described by Booth, m.j., et al, 2012 Science 336, 934-937, the contents of which are incorporated herein by reference. The oxidizing agent potassium perruthenate converts 5hmC to 5-formylcytosine (5 fC), and subsequent treatment with sodium bisulfite deaminates 5fC to form uracil. 5mC remained unchanged and could therefore be identified using this method.
APOBEC-coupled epigenetic sequencing (ACE-seq) completely excluded bisulfite conversion and relied on enzymatic conversion to detect 5hmC (e.g., as described by Schutsky, e.k. Et al, nat. Biotechnol.,2018 Oct 8, the contents of which are incorporated herein by reference). By this approach, T4-BGT glycosylation 5hmC is 5ghmC and protects it from deamination by apolipoprotein bmmrna editing enzyme subunit 3A (APOBEC 3A). Cytosine and 5mC were deaminated by APOBEC3A and sequenced as thymine.
In another example, a bisulfite-free and base-level resolution sequencing method, TET-assisted pyridine borane sequencing (TAPS), may be used for the detection of 5mC and 5hmC. TAPS combines 10-11 translocation (TET) oxidation of 5mC and 5hmC to 5-carboxycytosine (5 caC) with pyridine borane reduction of 5caC to Dihydrouracil (DHU). Subsequent PCR converts DHU to thymine, effecting C to T conversions of 5mC and 5hmC. TAPS detects modifications directly with high sensitivity and specificity without affecting unmodified cytosine. (e.g., as described by Liu, Y., et al Nat Biotechnol.2019 Apr;37 (4): 424-429, the contents of which are incorporated herein by reference).
TET-assisted 5-methylcytosine sequencing (TAmC-seq) enriches the 5mC locus and utilizes two sequential enzymatic reactions followed by affinity pull-down (e.g., as described by Zhang, l.2013, nat Commun 4. The fragment DNA was treated with T4-BGT to protect 5hmC by glycosylation. 5mC was then oxidised to 5hmC using mTET1 enzyme and the newly formed 5hmC was labelled with T4-BGT using a modified glucose moiety (6-N3-glucose). Click chemistry was used to introduce biotin tags, enabling enrichment of DNA fragments containing 5mc for detection and whole genome profiling.
B. Next generation sequencing
In some embodiments, the generation of sequencing reads is performed by next generation sequencing. This may allow a higher read depth for a given area. These can be high throughput methods including, for example
Figure BDA0003971977860000451
(Solexa) sequencing, DNB-Sequencer T7
Figure BDA0003971977860000452
Or G400 (MGI Tech Co., ltd.),
Figure BDA0003971977860000453
Sequencing (GenapSys, inc.), roche 454 Sequencing (Roche Sequencing Solutions, inc.), ion Torrent Sequencing (Thermo Fisher Scientific), and SOLID Sequencing (Thermo Fisher Scientific)
Figure BDA0003971977860000454
). The number of sequencing reads can be adjusted based on the amount of DNA input and the depth of the data required for analysis.
In some embodiments, the generation of sequencing reads is performed simultaneously on samples obtained from multiple patients, wherein each patient's cell-free nucleic acid fragments are barcoded. This allows for parallel analysis of multiple patients in one sequencing run.
In another aspect, the present disclosure provides a kit for detecting a tumor comprising reagents for performing the above method and instructions for detecting a tumor signal. The reagents may include, for example, primer sets, PCR reaction components, and/or sequencing reagents.
C. Targeted sequencing
In a targeted methylation sequencing method, targeted regions in a biological sample (such as cfDNA) are analyzed in order to determine the methylation status of a target gene sequence. In some embodiments, the target region includes, or hybridizes under stringent conditions to, adjacent nucleotides of a target region of interest (such as at least about 16 adjacent nucleotides of a target region of interest). In various examples, targeted sequencing can be achieved using hybrid capture and amplicon sequencing methods.
D. Hybrid Capture
The hybridization methods provided herein can be used for various forms of nucleic acid hybridization, such as in-solution hybridization and hybridization such as on solid supports (e.g., RNA, DNA, and in situ hybridization on membranes, microarrays, and cell/tissue slides). In particular, the methods are applicable to in-solution hybrid capture for target enrichment of certain types of genomic DNA sequences (e.g., exons) used in next generation targeted sequencing. For the hybrid capture method, a cell-free nucleic acid sample is subjected to library preparation. As used herein, "library preparation" includes end repair, a-tailing, aptamer ligation, or any other preparation of cell-free DNA to allow for subsequent DNA sequencing. In certain examples, the prepared cell-free nucleic acid library sequences contain aptamers, sequence tags, index barcodes linked to the cell-free nucleic acid sample molecules. Various commercially available kits can be utilized to facilitate library preparation for next generation sequencing methods. Construction of next generation sequencing libraries can include the use of a series of coordinated enzymatic reactions to prepare nucleic acid targets to generate a collection of random DNA fragments of a particular size for high throughput sequencing. Advances and advances in various library preparation techniques have expanded the application of next generation sequencing in fields such as transcriptomics and epigenetics.
Improvements in sequencing technology have led to variations and improvements in library preparation. By such as
Figure BDA0003971977860000461
Bioo
Figure BDA0003971977860000462
Kapa
Figure BDA0003971977860000463
New England
Figure BDA0003971977860000464
Life
Figure BDA0003971977860000465
Pacific
Figure BDA0003971977860000466
And
Figure BDA0003971977860000467
the next generation sequencing library preparation kit developed by the company provides consistency and repeatability for various molecular biological reactions, and ensures compatibility with the latest NGS instrument technology.
In different examples of targeted capture gene panels, various library preparation kits may be selected from Nextera Flex
Figure BDA0003971977860000468
DNA Prep
Figure BDA0003971977860000469
Ion
Figure BDA00039719778600004610
(Thermo Fisher
Figure BDA00039719778600004611
)、
Figure BDA00039719778600004612
(Thermo Fisher
Figure BDA00039719778600004613
)、Agilent ClearSeq
Figure BDA00039719778600004614
Capture
Figure BDA00039719778600004615
Bioo
Figure BDA00039719778600004616
Figure BDA00039719778600004617
xGen
Figure BDA00039719778600004618
Figure BDA00039719778600004619
And
Figure BDA00039719778600004620
Figure BDA00039719778600004621
in some embodiments, the hybrid capture method is performed using specific probes on the prepared library sequences. In some embodiments, the term "specific probe" as used herein generally refers to a probe specific for a known methylation site. In some embodiments, the design of specific probes is based on the use of the human genome as a reference sequence and specific genomic regions known to have methylation sites as target sequences. Specifically, genomic regions known to have methylation sites can include at least one of the following regions: promoter region, cpG island region, CGI island region and imprinted gene region. Thus, when hybridization capture is performed using specific probes of some embodiments, sequences in the sample genome that are complementary to the target sequence, e.g., regions in the sample genome known to have methylation sites (also referred to herein as "specific genomic regions"), can be effectively captured.
According to one example, the methylated regions described herein are used to design specific probes. In some embodiments, specific probes are designed using commercially available methods (e.g., like the erarray system). The length of the probe may be sufficient to hybridize with sufficient specificity to the methylated region of interest. In various examples, the probe is a 10-mer, 11-mer, 12-mer, 13-mer, 14-mer, 15-mer, 16-mer, 17-mer, 18-mer, 19-mer, or 20-mer.
The regions listed in tables 1-11 above are screened out using database resources, such as gene ontologies. According to the principle of complementary base pairing, a single-stranded capture probe can be complementarily combined with a single-stranded target sequence, thereby successfully capturing the target region. In some embodiments, the designed probes can be designed as solid capture chips (where the probes are immobilized on a solid support) or as liquid capture chips (where the probes are free in a liquid), but are limited by factors such as probe length, probe density, and high cost, etc., with solid capture chips being used rarely and liquid capture chips being used more often.
In some embodiments, GC-rich sequences in nucleic acids (where the GC base content is above 60%) may result in reduced capture efficiency due to the molecular structure of the C and G bases, as compared to normal sequences (where the average base content of A, T, C, G is 25%, respectively). For areas of intense study, such as the CGI region (CpG island), it may be advisable to design a larger number of probes to obtain sufficient and accurate CGI data.
E. Amplicon-based sequencing
The transformed DNA fragment can be amplified. In some embodiments, amplification is performed with primers designed to anneal to a methylated conversion target sequence having at least one methylation site therein. Methylation sequencing conversion results in the conversion of unmethylated cytosine to uracil, while 5-methylcytosine is unaffected. Thus, a "transformed target sequence" is understood to be the following sequence: wherein cytosine known as the site of methylation is fixed as "C" (cytosine), while cytosine known as the site of methylation is fixed as "U" (uracil; which can be considered as "T" (thymine) at the time of primer design).
In various examples, the source of DNA is cell-free DNA from whole blood, plasma, serum, or genomic DNA extracted from cells or tissues. In some embodiments, the amplified fragment is between about 100 and 200 base pairs in length. In some embodiments, the DNA source is extracted from a cellular source (e.g., tissue, biopsy, cell line) and the amplified fragment is between about 100 and 350 base pairs in length. In some embodiments, the amplified fragment comprises at least one 20 base pair sequence comprising at least one, at least two, at least three, or more than three CpG dinucleotides. Amplification can be performed using a set of primer oligonucleotides according to the present disclosure, and a thermostable polymerase can be used. Amplification of several DNA segments can be performed simultaneously in the same reaction vessel. In some embodiments, two or more fragments are amplified simultaneously. For example, amplification can be performed using the Polymerase Chain Reaction (PCR).
Primers designed to target these sequences may exhibit some degree of preference for the methylated sequences that have been converted. In some embodiments, PCR primers are designed to be methylation specific for targeted methylation sequencing applications. This may allow for higher sensitivity in some applications. For example, primers can be designed to contain identifiable nucleotides (specific for methylated sequences after bisulfite conversion) that are positioned to achieve optimal identification (e.g., in PCR applications). The identifier may be located at the 3' end or at the penultimate position.
In some embodiments, the primers are designed to amplify a DNA fragment of 75 to 350bp in length. This is a known general size range for circulating DNA, and according to this example, optimizing primer design to account for target size can improve the sensitivity of the method. The primers may be designed to amplify a region of about 50 to 200, about 75 to 150, or about 100 or 125bp in length.
In some embodiments of the methods described herein, methylation status of preselected CpG positions in a nucleic acid sequence can be detected by amplicon-based methods using methylation specific primer oligonucleotides. Amplification of bisulfite treated DNA using methylation status specific primers allows for discrimination between methylated and unmethylated nucleic acids. The MSP primer pair contains at least one primer that hybridizes to a transformed CpG dinucleotide. Thus, the sequence of the primer comprises at least one CpG, tpG or CpA dinucleotide. MSP primers specific for unmethylated DNA contain a "T" at the 3' position of the C in CpG. Thus, the base sequence of the primer may desirably comprise a sequence of at least 18 nucleotides in length that hybridizes to the pre-treated nucleic acid sequence and its complement, wherein the base sequence of the oligomer comprises at least one CpG, tpG or CpA dinucleotide. In some embodiments, the MSP primer comprises 2 to 5 CpG, tpG or CpA dinucleotides. In some embodiments, the dinucleotide is located within the 3 'half of the primer, e.g., for a primer of 18 bases in length, the designated dinucleotide is located within the first 9 bases from the 3' end of the molecule. In addition to CpG, tpG or CpA dinucleotides, primers may also comprise several methyl-converting bases (e.g., cytosine to thymine, or guanine to adenosine on the hybrid strand). In some embodiments, the primer is designed to contain no more than 2 cytosine or guanine bases.
In some embodiments, each region is amplified on a segmented segment with multiple primers. In some embodiments, the segments do not overlap. These segments may be directly adjacent or spaced apart (e.g., up to 10, 20, 30, 40, or 50bp apart). Since target regions (including CpG islands, cpG banks and/or CpG strands) are typically longer than 75 to 150bp, this example allows assessment of methylation status at more (or all) sites across a given target region.
Primers can be designed for the target region using suitable tools such as Primer3, primer3Plus, primer-BLAST, and the like. As discussed, bisulfite conversion results in conversion of cytosine to uracil and 5' -methylcytosine to thymine. Thus, primer localization or targeting can utilize bisulfite converted methylated sequences, depending on the degree of methylation specificity desired.
The amplified target region is designed to have at least 10 CpG dinucleotide methylation sites. However, in some instances, it may be advantageous to amplify regions having more than 10 CpG methylation sites. For example, a sequence read 300bp long may have about 10, 20, 30, 40, or 50 CpG methylation sites that are methylated in a nucleic acid sample associated with a colonic cell proliferative disorder. In various examples, the methylated regions identified in tables 1-11 can have at least 25, 50, 100, 200, 300, 400, or 500 CpG methylation sites that are methylated in a nucleic acid sample associated with a colonic cell proliferative disorder. In some embodiments, the primers are designed to amplify a DNA fragment comprising 3 to 20 CpG methylation sites in the targeted region. Overall, this approach allows more methylation sites to be queried in a single sequencing read and provides additional certainty (excluding false positives) since multiple consistent methylation may be detected in a single sequencing read. In some embodiments, the tumor signal comprises more than two methylated regions selected from tables 1-11. In this example, detecting multiple tumor signals may increase the confidence of the tumor detection. Such signals may be at the same site or at different sites. In some embodiments, detection of more than one tumor signal at the same region is indicative of a tumor.
In some embodiments, the number of CpG sites in the identified methylation region can be modeled between two populations having distinct characteristics of a colon cell proliferative disorder to identify a methylation threshold, wherein a number of CpG sites in one region exceeding the threshold is indicative of a colon cell proliferative disorder.
In various examples, the number of CpG sites in the identified methylated regions that indicate colorectal cancer is at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17 or 18, wherein the presence of methylated CpG, if exceeding this identified number, is indicative of colorectal cancer and can be used as an input feature for a machine learning model that serves as a classifier for stratifying the population into healthy and colorectal cancer individuals.
In this example, detection of multiple tumor signals indicative of methylation at the same site in the genome can increase the confidence of tumor detection. Detection of methylation at adjacent sites in the genome can improve the confidence of tumor detection even if the signals are from different sequencing reads. This reflects another type of signal consistency. In some embodiments, detection of adjacent or overlapping tumor signals in at least two different sequencing reads is indicative of a tumor. In some embodiments, adjacent or overlapping tumor signals are within the same CpG island. In some embodiments, detection of 3 to 34 proximal methylation sites in the cell-free DNA fragment is indicative of a tumor. In some embodiments, detection of 3 to 34 methylated CpG sites in a fragment is used to identify a threshold to distinguish a population of individuals with a trait (e.g., health, disease or stage of disease). In some embodiments, detection of about 4 to 10, about 4 to 15, about 10 to 20, about 15 to 25, about 20 to 34, about 25 to 34, or about 30 to 34 methylated proximal CpG sites in a read fragment is used to determine a threshold to distinguish a population of individuals having a trait (e.g., health, disease, or stage of disease). As used herein, the term "proximal CpG site" refers to CpG sites that are adjacent or between 2 to 10 CpG sites on the same nucleic acid fragment in a cell-free nucleic acid sample.
In some embodiments, amplification is performed using more than 100 primer pairs. Amplification can be performed using about 10, about 20, about 30, about 40, about 50, about 60, about 70, about 80, about 90, about 100, about 110, about 120, about 130, about 140, about 150, or more primer pairs. In some embodiments, the amplification is multiplex amplification. Multiplex amplification allows for the collection of large amounts of methylation information in parallel from many target regions of the genome, even from cfDNA samples where DNA is not usually abundant. Composition is extensible to a platform, such as Ion
Figure BDA0003971977860000511
Where up to about 24000 amplicons can be queried simultaneously. In some embodiments, the amplification is nested amplification. Nested amplification can improve sensitivity and specificity.
Furthermore, another fast and robust scheme for examining multiple methylated sequences in parallel is known as synchronous targeted methylation sequencing (sTM-Seq). The main features of this technology include the elimination of the need for large amounts of high molecular weight DNA, and the nucleotide-specific differentiation of 5-methylcytosine (5 mC) from 5-hydroxymethylcytosine (5 hmC). Furthermore, sttm-Seq is scalable and can be used to investigate multiple loci in dozens of samples in one sequencing run. Web-based software and universal primers are offered for free for use in multipurpose barcodes, library preparation and custom sequencing, which make sTM-Seq affordable, efficient, and widely applicable (e.g., as described by Asmus, N.et al, curr Protoc Hum Genet.2019Apr;101 (1), the contents of which are incorporated herein by reference).
In general, the methods and systems provided herein are useful for preparing cell-free polynucleotide sequences for downstream application of sequencing reactions. In some embodiments, the sequencing method is classical sanger sequencing. Sequencing methods may include, but are not limited to: high-throughput sequencing, pyrosequencing, sequencing by synthesis, single molecule sequencing, nanopore sequencing, semiconductor sequencing, sequencing by ligation, sequencing by hybridization, RNA-Seq
Figure BDA0003971977860000512
Digital Gene Expression
Figure BDA0003971977860000513
Next generation sequencing, single molecule sequencing by Synthesis (SMSS)
Figure BDA0003971977860000514
Massively parallel sequencing, clonal Single molecular Array (Solexa), shotgun sequencing, maxim-Gilbert sequencing, primer walking and any other sequencing method.
Pyrosequencing may refer to a real-time sequencing technique based on photometric detection of pyrophosphate release after nucleotide incorporation, suitable for simultaneous analysis and quantification of the degree of methylation at several CpG positions. After genomic DNA transformation, the target region is amplified using Polymerase Chain Reaction (PCR), in which one of the two primers is biotinylated. The PCR generated template was presented as a single strand and the pyrosequencing primer was annealed to quantify CpG positions. After bisulfite treatment and PCR, the degree of each methylation at each CpG position in the sequence is determined by the ratio of the T to C signals, reflecting the ratio of unmethylated to methylated cytosine at each CpG position in the original sequence.
V. classifier, machine learning model and system
In various examples, the methylation sequencing features are used as an input dataset to a trained algorithm (e.g., a machine learning model or classifier) to find correlations between sequence compositions and patient groupings. Examples of such patient groupings include the presence, stage, subtype, responders and non-responders, and progressors and non-progressors of the disease or condition. In various examples, feature matrices are generated to compare samples obtained from individuals with known conditions or characteristics. In some embodiments, the sample is obtained from a healthy individual or an individual without any known indications and the sample is obtained from a patient known to have cancer.
As used herein, with respect to machine learning and pattern recognition, the term "feature" generally refers to a single measurable characteristic or characteristic of an observed phenomenon. The concept of "features" is related to the concept of explanatory variables used in statistical techniques, such as, but not limited to, linear regression and logistic regression. Features are usually numeric, but structural features such as strings and graphs are used in grammar pattern recognition.
As used herein, the term "input features" (or "features") generally refers to variables, such as conditions, sequence content (e.g., mutations), suggested data collection operations, or suggested processes, that are used by a trained algorithm (e.g., a model or classifier) to predict an output classification (label) of a sample. The value of the variable can be determined as a sample and used to determine the classification.
In various examples, the input features of the genetic data include: an alignment variable associated with alignment of sequence data (e.g., sequence reads) to a genome, and a non-alignment variable, e.g., associated with the sequence content of a sequence read, a measurement of protein or autoantibody, or an average methylation level of a genomic region. The input features may be gene features such as V-mapping metrics, FREE-C deconvolution, chromatin accessibility, and cfDNA measurements at transcription start sites. Indicators that may be used in methylation analysis include, but are not limited to: percent base-by-base methylation of CpG, CHG, CHH, conversion efficiency (100-average percent methylation of CHH), hypomethylated segments, methylation levels (overall average methylation of CpG, CHH, CHG, fragment length, fragment midpoint, and methylation level in one or more genomic regions such as chrM, LINE1, or ALU), number of methylated CpG per fragment, fraction of CpG methylation per fragment over total CpG, fraction of CpG methylation per region in the panel, fraction of total CpG in the panel, dinucleotide coverage (normalized dinucleotide coverage), uniformity of coverage (unique CpG sites under 1x and 10x average genomic coverage (run for S4)), overall average CpG coverage (depth), and average coverage at CpG islands, CGI scaffolds, CGI banks. These indices may be used as feature inputs for machine learning methods and models.
For multiple assays, the system identifies a feature set for input into a trained algorithm (e.g., a machine learning model or classifier). The system analyzes each molecular category and forms a feature vector from the measurements. The system inputs the feature vectors into a machine learning model and obtains an output classification of whether the biological sample has the specified characteristics.
In some embodiments, the machine learning model outputs a classifier that is capable of distinguishing between two or more groupings or categories of individuals or features in a population of individuals or features of a population. In some embodiments, the classifier is a trained machine learning classifier.
In some embodiments, informative loci or features of biomarkers in tumor tissue are analyzed to form a profile. A Receiver Operating Characteristic (ROC) curve can be generated by plotting the performance of a particular feature (e.g., any biomarker and/or any additional item of biomedical information described herein) in distinguishing between two populations (e.g., an individual who is responsive to a therapeutic agent and an individual who is non-responsive). In some embodiments, the feature data across the entire population (e.g., cases and controls) is sorted in ascending order based on a single feature value.
In various examples, the specified property is selected from the group consisting of health and cancer, disease subtype, disease stage, progressor and non-progressor, and responder and non-responder.
In some embodiments, the colonic cell proliferative disorder is selected from the group consisting of: adenomas (adenomatous polyps), sessile Serrated Adenomas (SSA), advanced adenomas, colorectal dysplasias, colorectal adenomas, colorectal cancer, colon cancer, rectal cancer, colorectal epithelial carcinoma, colorectal adenocarcinoma, carcinoid tumors, gastrointestinal stromal tumors (GIST), lymphomas, and sarcomas. In some embodiments, the colon cell proliferative disorder comprises colorectal cancer.
A. Data analysis
In some examples, the present disclosure provides a system, method, or kit in which data analysis may be implemented in a software application, computing hardware, or both. In various examples, an analysis application or system includes at least one data receiving module, a data preprocessing module, a data analysis module (which can operate on one or more types of genomic data), a data interpretation module, or a data visualization module. In some embodiments, the data receiving module may include a computer system that connects laboratory hardware or instruments with a computer system that processes laboratory data. In some embodiments, the data pre-processing module may include a hardware system or computer software that performs operations on the data in preparation for analysis. Examples of operations that may be applied to the data in the pre-processing module include affine transformation, de-noising operations, data cleansing, reformatting, or sub-sampling. The data analysis module can be dedicated to analyzing genomic data from one or more genomic materials, for example, the assembled genomic sequences can be obtained and subjected to probabilistic and statistical analysis to identify abnormal patterns associated with a disease, pathology, condition, risk, condition, or phenotype. The data interpretation module may use analytical methods, e.g., extracted from statistics, mathematics or biology, to support understanding the relationship between the identified abnormal patterns and the health condition, functional state, prognosis or risk. The data visualization module may use mathematical modeling, computer graphics, or rendering methods to create a visualization presentation of the data that may facilitate understanding or interpretation of the results.
In various examples, a machine learning method is applied to distinguish samples in a population of samples. In some embodiments, machine learning methods are applied to differentiate healthy from advanced disease (e.g., adenoma) samples.
In some embodiments, the one or more machine learning operations used to train the prediction engine include one or more of: generalized linear models, generalized additive models, nonparametric regression operations, random forest classifiers, spatial regression operations, bayesian regression models, time series analysis, bayesian networks, gaussian networks, decision tree learning operations, artificial neural networks, recurrent neural networks, convolutional neural networks, reinforcement learning operations, linear or nonlinear regression operations, support vector machines, clustering operations, and genetic algorithm operations.
In various examples, the computer processing method is selected from the group consisting of logistic regression, multiple Linear Regression (MLR), dimensionality reduction, partial Least Squares (PLS) regression, principal component regression, autoencoder, variational autoencoder, singular value decomposition, fourier basis, wavelets, discriminant analysis, support vector machines, decision trees, classification and regression trees (CART), tree-based methods, random forests, gradient-marching trees, logistic regression, matrix decomposition, multidimensional scaling (MDS), dimensionality reduction methods, t-distributed random neighborhood embedding (t-SNE), multi-layered perceptron (MLP), network clustering, neuro-fuzzy, and artificial neural networks.
In some examples, the methods disclosed herein can include computational analysis of nucleic acid sequencing data from samples of an individual or multiple individuals.
B. Classifier generation
In one aspect, the disclosed systems and methods provide a classifier that is generated based on feature information derived from methylation sequence analysis of cfDNA biological samples. The classifier forms part of a prediction engine for distinguishing groups in a population according to sequence features identified in a biological sample (such as cfDNA).
In some embodiments, the classifier is created by: formatting the similar part of the sequence information into a uniform format and a uniform scale to normalize the sequence information; storing the normalized sequence information in a columnar database; training a prediction engine by applying one or more machine learning operations to the stored normalized sequence information, the prediction engine mapping combinations of one or more features for a particular population; applying a prediction engine to the accessed field information to identify individuals associated with the group; and dividing individuals into groups.
In some embodiments, the hierarchy is created by: formatting the similar part of the sequence information into a uniform format and a uniform scale to normalize the sequence information; storing the normalized sequence information in a columnar database; training a prediction engine by applying one or more machine learning operations to the stored normalized sequence information, the prediction engine mapping combinations of one or more features for a particular population; applying a prediction engine to the accessed field information to identify individuals associated with the group; and dividing individuals into groups.
Specificity as used herein generally refers to the "probability that a test is negative in an individual who is not afflicted". It can be calculated by dividing the number of disease-free individuals for which the test result is negative by the total number of disease-free individuals.
In various examples, the model, classifier, or predictive test has the following specificity: at least 40%, at least 45%, at least 50%, at least 55%, at least 60%, at least 65%, at least 70%, at least 75%, at least 80%, at least 85%, at least 90%, at least 95%, or at least 99%.
Sensitivity, as used herein, generally refers to "the probability that a test result is positive in an individual with a disease". It can be calculated by dividing the number of diseased individuals who test positive by the total number of diseased individuals.
In various examples, the model, classifier, or predictive test has the following sensitivities: at least 40%, at least 45%, at least 50%, at least 55%, at least 60%, at least 65%, at least 70%, at least 75%, at least 80%, at least 85%, at least 90%, at least 95%, or at least 99%.
A positive predictive value as used herein generally refers to the "probability that a positive test result is correct". It can be calculated by dividing the number of true positive test results by the total number of positive test results.
In various examples, the model, classifier, or predictive test has the following positive predictive value: at least 40%, at least 45%, at least 50%, at least 55%, at least 60%, at least 65%, at least 70%, at least 75%, at least 80%, at least 85%, at least 90%, at least 95%, or at least 99%.
Negative predictive value as used herein generally refers to the "probability that a negative test result is correct". It can be calculated by dividing the number of true negative tests by the total number of negative tests.
In various examples, the model, classifier, or predictive test has the following negative predictive value: at least 40%, at least 45%, at least 50%, at least 55%, at least 60%, at least 65%, at least 70%, at least 75%, at least 80%, at least 85%, at least 90%, at least 95%, or at least 99%.
C. Digital processing device
In some examples, the subject matter described herein may include a digital processing device or a use thereof. In some examples, the digital processing device may include one or more hardware Central Processing Units (CPUs), graphics Processing Units (GPUs), or Tensor Processing Units (TPUs) that perform the functions of the device. In some examples, the digital processing device may include an operating system configured to execute executable instructions.
In some examples, the digital processing device is optionally connected to a computer network. In some examples, the digital processing device is optionally connected to the internet. In some examples, the digital processing device is optionally connected to a cloud computing facility. In some instances, the digital processing device is optionally connected to an intranet. In some examples, the digital processing device is optionally connected to a data storage device.
Non-limiting examples of suitable digital processing devices include server computers, desktop computers, notebook computers, sub-notebook computers, netbook computers, netpad computers, set-top box computers, handheld computers, internet appliances, mobile smartphones, and tablet computers. Suitable tablet computers may include, for example, those having booklets, notepads, and convertible configurations.
In some examples, the digital processing device may includeAn operating system configured to execute executable instructions. For example, an operating system may include software, including programs and data, for managing the hardware of the device and providing services for the execution of applications. Non-limiting examples of operating systems include Ubuntu, freeBSD, openBSD, and,
Figure BDA0003971977860000572
Linux、
Figure BDA0003971977860000571
Mac OS X
Figure BDA0003971977860000573
Windows
Figure BDA0003971977860000574
And
Figure BDA0003971977860000575
non-limiting examples of suitable personal computer operating systems include
Figure BDA0003971977860000576
Figure BDA0003971977860000577
Mac OS
Figure BDA0003971977860000578
And UNIX-like operating systems, such as
Figure BDA0003971977860000579
In some instances, the operating system may be provided by cloud computing, and the cloud computing resources may be provided by one or more service providers.
In some examples, a device may include storage and/or memory devices. The storage and/or memory means may be one or more physical devices for temporarily or permanently storing data or programs. In some instances, the device may be a volatile memory and require power to maintain the stored information. In some examples, the device is a non-volatile memory and retains stored information when the digital processing device is not powered. In some examples, the non-volatile memory may include flash memory. In some examples, the non-volatile memory may include Dynamic Random Access Memory (DRAM). In some examples, the non-volatile memory may include Ferroelectric Random Access Memory (FRAM). In some examples, the non-volatile memory may include phase change random access memory (PRAM).
In some examples, the device may be a storage device including, for example, a CD-ROM, a DVD, a flash memory device, a disk drive, a tape drive, an optical disk drive, and cloud-based storage. In some examples, the storage and/or memory devices may be a combination of devices such as those disclosed herein. In some examples, the digital processing device may include a display that sends visual information to the user. In some examples, the display may be a Cathode Ray Tube (CRT). In some examples, the display may be a Liquid Crystal Display (LCD). In some examples, the display may be a thin film transistor liquid crystal display (TFT-LCD). In some examples, the display may be an Organic Light Emitting Diode (OLED) display. In some examples, the OLED display may be a Passive Matrix OLED (PMOLED) or Active Matrix OLED (AMOLED) display. In some examples, the display may be a plasma display. In some examples, the display may be a video projector. In some examples, the display may be a combination of devices such as those disclosed herein.
In some examples, the digital processing device may include an input device that receives information from a user. In some examples, the input device may be a keyboard. In some examples, the input device may be a pointing device, including, for example, a mouse, trackball, trackpad, joystick, game controller, or stylus. In some examples, the input device may be a touch screen or a multi-touch screen. In some examples, the input device may be a microphone for capturing voice or other sound input. In some examples, the input device may be a camera for capturing motion or visual input. In some examples, the input device may be a combination of devices such as those disclosed herein.
D. Non-transitory computer readable storage medium
In some examples, the subject matter disclosed herein may include one or more non-transitory computer-readable storage media encoded with a program containing instructions executable by an operating system of an optional network digital processing device. In some examples, the computer readable storage medium may be a tangible component of a digital processing apparatus. In some examples, the computer readable storage medium is optionally removable from the digital processing apparatus. In some examples, the computer-readable storage medium may include, for example, CD-ROMs, DVDs, flash memory devices, solid state memory, disk drives, tape drives, optical drives, cloud computing systems and services, and the like. In some examples, programs and instructions may be encoded on media permanently, substantially permanently, semi-permanently, or non-temporarily.
E. Computer system
The present disclosure provides a computer system programmed to implement the methods described herein. Fig. 1 illustrates a computer system 101 programmed or otherwise configured to store, process, identify or interpret patient data, biological sequences, and reference sequences. The computer system 101 may process various aspects of the patient data, biological sequence, or reference sequence of the present disclosure. Computer system 101 may be a user's electronic device or a computer system located remotely from the electronic device. The electronic device may be a mobile electronic device.
Computer system 101 includes a central processing unit (CPU, also referred to herein as a "processor" and "computer processor") 105, which may be a single or multi-core processor, or multiple processors for parallel processing. Computer system 101 also includes a memory or storage location 110 (e.g., random access memory, read only memory, flash memory), an electronic storage unit 115 (e.g., hard disk), a communication interface 120 (e.g., a network adapter) for communicating with one or more other systems, and peripheral devices 125, such as a cache, other memory, data storage, and/or an electronic display adapter. Memory 110, storage unit 115, interface 120, and peripheral devices 125 communicate with CPU 105 through a communication bus (solid line), such as a motherboard. The storage unit 115 may be a data storage unit (or data repository) for storing data. Computer system 101 may be operatively coupled to a computer network ("network") 130 by way of a communication interface 120. The network 130 may be the internet, the internet and/or an extranet, or an intranet and/or extranet in communication with the internet. In some examples, the network 130 is a telecommunications and/or data network. The network 130 may include one or more computer servers, which may implement distributed computing, such as cloud computing. In some instances, network 130 may implement a peer-to-peer network with the aid of computer system 101, which may cause devices coupled to computer system 101 to appear as clients or servers.
CPU 105 may execute a series of machine-readable instructions, which may be embodied in a program or software. The instructions may be stored in a memory location, such as memory 110. The instructions may be directed to the CPU 105, which may then program or otherwise configure the CPU 105 to implement the methods of the present disclosure. Examples of operations performed by CPU 105 may include fetch, decode, execute, and write-back.
CPU 105 may be part of a circuit, such as an integrated circuit. One or more other components of system 101 may be included in a circuit. In some examples, the circuit is an Application Specific Integrated Circuit (ASIC).
The storage unit 115 may store files such as drivers, libraries, and saved programs. The storage unit 115 may store user data, such as user preferences and user programs. In some instances, computer system 101 may include one or more additional data storage units external to computer system 101, such as on a remote server in communication with computer system 101 over an intranet or the internet.
Computer system 101 may communicate with one or more remote computer systems over network 130. For example, computer system 101 may communicate with a remote computer system of a user. Of remote computer systemsExamples include a personal computer (e.g., a laptop PC), a tablet PC (e.g.,
Figure BDA0003971977860000602
iPad、
Figure BDA0003971977860000601
galaxy Tab), telephone, smartphone (e.g.,
Figure BDA0003971977860000603
iPhone, android-enabled device,
Figure BDA0003971977860000604
) Or a personal digital assistant. A user may access computer system 101 via network 130.
The methods as described herein may be implemented by machine (e.g., computer processor) executable code stored on an electronic storage location of computer system 101 (e.g., such as, for example, on memory 110 or electronic storage unit 115). The machine executable or machine readable code may be provided in the form of software. During use, the code may be executed by the processor 105. In some examples, code may be retrieved from storage unit 115 and stored on memory 110 for access by processor 105. In some instances, the electronic storage unit 115 may be eliminated, and the machine executable instructions stored on the memory 110.
The code may be pre-compiled and configured for use with a machine having a processor adapted to execute the code, or may be interpreted or compiled at runtime. The code may be provided in a programming language that may be selected to enable the code to be executed in a pre-compiled, interpreted or compiled form.
Aspects of the systems and methods provided herein, such as computer system 101, may be embodied in programming. Various aspects of the described technology may be considered as an "article of manufacture" or "article of manufacture", typically in the form of machine (or processor) executable code and/or associated data, carried or embodied in a type of machine-readable medium. The machine executable code may be stored on an electronic storage unit such as a memory (e.g., read only memory, random access memory, flash memory) or a hard disk. A "storage" type medium may include any or all of a tangible memory or associated modules of a computer, processor, etc., such as various semiconductor memories, tape drives, disk drives, etc., that may provide non-transitory storage for software programming at any time. All or part of the software may sometimes communicate over the internet or various other telecommunications networks. For example, such communication may enable loading of software from one computer or processor into another computer or processor, e.g., from a management server or host computer into the computer platform of an application server. Thus, another type of medium that can carry software elements includes optical, electrical, and electromagnetic waves, such as those used over physical interfaces between local devices through wired and optical land line networks and various air links. The physical elements that carry such waves, such as wired or wireless links, optical links, etc., may also be considered to be media that carry software. As used herein, unless limited to a non-transitory, tangible "storage" medium, terms such as a computer or machine "readable medium" refer to any medium that participates in providing instructions to a processor for execution.
Thus, a machine-readable medium (such as computer executable code) may take many forms, including but not limited to, tangible storage media, carrier wave media, or physical transmission media. Non-volatile storage media include, for example, optical or magnetic disks, any storage device such as any one or more computers or the like, such as may be used to implement the databases and the like shown in the figures. Volatile storage media includes dynamic memory, such as the main memory of such computer platforms. Tangible transmission media include coaxial cables; copper wire and fiber optics, including the wires that comprise a bus within a computer system. Carrier-wave transmission media can take the form of electrical or electromagnetic signals, or acoustic or light waves, such as those generated during Radio Frequency (RF) and Infrared (IR) data communications. Thus, common forms of computer-readable media include, for example: a floppy disk, a flexible disk, hard disk, magnetic tape, any other magnetic medium, a CD-ROM, DVD or DVD-ROM, any other optical medium, punch cards paper tape, any other physical storage medium with patterns of holes, a RAM, a ROM, a PROM, and EPROM, a FLASH-EPROM, any other memory chip or cartridge, a carrier wave transporting data or instructions, cables or links transporting such a carrier wave, or any other medium from which a computer can read programming code and/or data. Many of these forms of computer readable media may be involved in carrying one or more sequences of one or more instructions to a processor for execution.
The computer system 101 may include or be in communication with an electronic display 135, the electronic display 135 including a User Interface (UI) 140 for providing, for example, nucleic acid sequences, concentrated nucleic acid samples, methylation profiles, expression profiles, and analysis of methylation or expression profiles. Examples of UIs include, but are not limited to, graphical User Interfaces (GUIs) and web-based user interfaces.
The methods and systems of the present disclosure may be implemented by one or more algorithms. The algorithms may be implemented in software when executed by the central processing unit 105. For example, the algorithm can store, process, identify, or interpret patient data, biological sequences, and reference sequences.
While certain examples of the methods and systems have been shown and described herein, those of skill will realize that these are provided by way of example only and are not intended to be limiting in the specification. Numerous variations, changes, and substitutions will now occur to those skilled in the art without departing from the scope of the invention as described herein. Moreover, it is to be understood that all aspects of the methods and systems are not limited to the specific depictions, configurations or relative proportions set forth herein which depend upon a variety of conditions and variables, and that the depictions are intended to encompass such alternatives, modifications, variations or equivalents.
In some examples, the subject matter disclosed herein may include at least one computer program or use thereof. The computer program may be a sequence of instructions executed in a CPU, GPU or TPU of the digital processing apparatus, written to perform specified tasks. Computer readable instructions may be implemented as program modules, such as functions, objects, application Programming Interfaces (APIs), data structures, and so forth, that perform particular tasks or implement particular abstract data types. In view of the disclosure provided herein, computer programs can be written in various languages in various versions.
The functionality of the computer readable instructions may be combined or distributed as desired in various environments. In some instances, a computer program may include a sequence of instructions. In some instances, a computer program may include multiple sequences of instructions. In some instances, the computer program may be provided by a location. In some instances, the computer program may be provided by multiple locations. In some instances, a computer program may include one or more software modules. In some instances, the computer program may include, in part or in whole, one or more web applications, one or more mobile applications, one or more standalone applications, one or more web browser plug-ins, extension items, add-on items, or additional items, or a combination thereof.
In some examples, the computer process may be a method of statistics, mathematics, biology, or any combination thereof. In some examples, the computer processing methods include dimension reduction methods, including, for example, logistic regression, dimension reduction, principal component analysis, autoencoder, singular value decomposition, fourier basis, singular value decomposition, wavelets, discriminant analysis, support vector machines, tree-based methods, random forests, gradient push trees, logistic regression, matrix decomposition, network clustering, and neural networks.
In some examples, the computer processing method is a supervised machine learning method, including, for example, regression, support vector machine, tree-based methods, and networks.
In some examples, the computer processing method is an unsupervised machine learning method including, for example, clustering, networking, principal component analysis, and matrix decomposition.
F. Database with a plurality of databases
In some examples, the subject matter disclosed herein can include one or more databases, or uses of the databases to store patient data, biological sequences, or reference sequences. The reference sequence may be derived from a database. In view of the disclosure provided herein, many databases may be suitable for storing and retrieving sequence information. In some instances, suitable databases may include, for example, relational databases, non-relational databases, object-oriented databases, object databases, entity-relational model databases, relational databases, and XML databases. In some instances, the database may be internet-based. In some instances, the database may be web-based. In some instances, the database may be cloud computing based. In some instances, the database may be based on one or more local computer storage devices.
In one aspect, the present disclosure provides a non-transitory computer-readable medium comprising instructions that direct a processor to perform the methods described herein.
In one aspect, the present disclosure provides a computing device comprising a computer-readable medium.
In another aspect, the present disclosure provides a system for classifying a biological sample, comprising: a) a receiver that receives a plurality of training samples, each of the plurality of training samples having a plurality of classes of molecules, wherein each of the plurality of training samples comprises one or more known labels, b) a feature module that identifies an operable set of features corresponding to an assay for input into a machine learning model for each of the plurality of training samples, wherein the set of features corresponds to molecular characteristics in the plurality of training samples, wherein for each of the plurality of training samples, the system is operable to perform a plurality of different assays on the plurality of classes of molecules in the training sample to obtain sets of measurements, wherein each set of measurements is from a single assay performed on a class of molecules in the training sample, wherein a plurality of sets of measurements are obtained for the plurality of training samples, c) an analysis module, a training vector for analyzing the set of measured values to obtain training samples, wherein the training vector comprises feature values corresponding to the N feature sets determined, each feature value corresponding to a feature and comprising one or more measured values, wherein the training vector is formed using at least one feature from at least two of the N feature sets corresponding to a first subset of the plurality of different determinations, d) a labeling module for informing the system of information about the training vector using parameters of the machine learning model in order to obtain output labels for the plurality of training samples, e) a comparator module for comparing the output labels with known labels of the training samples, f) a training module for iteratively searching for optimal values of the parameters as part of training the machine learning model based on the comparison of the output labels with the known labels of the training samples, and g) an output module for providing parameters of the machine learning model and a feature set of the machine learning model.
Method for classifying objects in a population
The disclosed methods are directed to determining genetic and/or epigenetic parameters of genomic DNA associated with a colonic cell proliferative disorder by analysis of cfDNA in a subject. The methods are useful for improving the diagnosis, treatment and monitoring of colon cell proliferative disorders, more particularly by improving the identification and differentiation between stages or subclasses of the disorder and genetic susceptibility of the disorder.
In some embodiments, the method comprises analyzing the methylation status of a CpG island, cpG bank, or CpG scaffold.
In some embodiments, the method comprises analyzing the methylation state, hemimethylation state, hypermethylation state, or hypomethylation state of cell-free nucleic acids in the biological sample.
In one aspect, the present disclosure provides a method for detecting a colon cell proliferative disorder, which can be applied to a cell-free sample, for example, to detect colon cell proliferative disorder DNA in cell-free circulation. The method utilizes the detection of methylation signals in a single sequencing read as the primary "positive" colon cell proliferative disorder signal.
In some embodiments, the colonic cell proliferative disorder is selected from the group consisting of: adenomas (adenomatous polyps), sessile Serrated Adenomas (SSA), advanced adenomas, colorectal dysplasias, colorectal adenomas, colorectal cancer, colon cancer, rectal cancer, colorectal epithelial carcinoma, colorectal adenocarcinoma, carcinoid tumors, gastrointestinal stromal tumors (GIST), lymphomas, and sarcomas. In some embodiments, the colon cell proliferative disorder comprises colorectal cancer.
In one aspect, the present disclosure provides a method for detecting a colon cell proliferative disorder, comprising: extracting DNA from a cell-free sample obtained from a subject, transforming at least a portion of the DNA for methyl sequencing, amplifying methylated regions in the cancer produced by the transformed DNA, generating sequencing reads from the amplified regions, and detecting signals of a colon cell proliferative disorder comprising at least one, at least two, at least three, or more than three methylated regions within a cancer panel to obtain input features, which are input into a machine learning model to obtain a classifier capable of distinguishing between two groups of subjects (e.g., healthy versus cancer, disease stage, advanced adenoma versus cancer).
The trained machine learning methods, models, and discriminative classifiers described herein are applicable to a variety of medical applications, including cancer detection, diagnosis, and therapy responsiveness. Since the model can be trained with individual metadata and analyte-derived features, the application can be customized to stratify individuals in a population and guide treatment decisions accordingly.
Diagnosis of
The methods and systems provided herein can perform predictive analysis using artificial intelligence-based methods to analyze data obtained from a subject (patient) to generate a diagnostic output for a subject with cancer (e.g., colorectal cancer). For example, the application may apply a predictive algorithm to the acquired data to generate a diagnosis of a subject with cancer. The predictive algorithm may include an artificial intelligence based predictor, such as a machine learning based predictor, configured to process the acquired data to generate a diagnosis of the cancerous subject.
The machine-learned predictor can be trained using a dataset, e.g., a dataset generated by methylation profiling an individual biological sample from a set of one or more cancer-bearing patient queues using a signature panel as described herein as an input, and a known diagnosis (e.g., stage and/or tumor score) result of the subject as an output of the machine-learned predictor.
A training data set (e.g., a data set generated from methylation measurements of an individual biological sample using a signature panel as described herein) can be generated from, for example, one or more object sets having common traits (characteristics) and results (labels). The training data set may include features and a set of labels corresponding to the diagnostically relevant features. Features may include features such as, for example, certain ranges or categories of cfDNA assay measurements, such as counts of cfDNA fragments in biological samples obtained from healthy and disease samples that overlap or fall in a set of bins (genome windows) of a reference genome. For example, a set of features collected from a given subject at a given point in time may collectively serve as a diagnostic signature, which may indicate that the subject has an identified cancer at the given point in time. The characteristics may also include markers indicative of the subject's diagnostic outcome, such as one or more cancers.
The indicia may include results, such as known diagnostic (e.g., stage and/or tumor score) results for the subject. The results may include characteristics associated with cancer in the subject. For example, a trait may indicate that a subject has one or more cancers.
A training set (e.g., a training data set) may be selected by random sampling of one data set corresponding to one or more object sets (e.g., a retrospective and/or prospective patient cohort with or without one or more cancers). Alternatively, a training set (e.g., a training data set) may be selected by sampling a proportion of one data set corresponding to one or more object sets (e.g., a retrospective and/or prospective patient cohort with or without one or more cancers). The training set may be balanced between data sets corresponding to one or more object sets (e.g., patients from different clinical sites or trials). The machine-learned predictor may be trained until certain predetermined accuracy or performance conditions are met, such as having a minimum expected value corresponding to a diagnostic accuracy metric. For example, the diagnostic accuracy metric may correspond to a prediction of a diagnosis, stage, or tumor score of one or more cancers of the subject.
Examples of diagnostic accuracy metrics can include sensitivity, specificity, positive Predictive Value (PPV), negative Predictive Value (NPV), accuracy, and area under the curve (AUC) of the subject operating characteristic (ROC) curve corresponding to diagnostic accuracy of detecting or predicting cancer (e.g., colorectal cancer).
In one aspect, the present disclosure provides a method of using a classifier capable of discriminating between populations of individuals, comprising: a) Performing an analysis of a plurality of classes of molecules in a biological sample, wherein the analysis provides a plurality of sets of measurements representative of the plurality of classes of molecules; b) Identifying a set of features corresponding to characteristics of each of a plurality of class molecules input into a machine learning or statistical model; c) Preparing a feature vector of feature values from each of the plurality of measurement value sets, each feature value corresponding to a feature of a feature set and including one or more measurement values, wherein the feature vector includes at least one feature value obtained using each of the plurality of measurement value sets; d) Loading into a memory of a computer system: a trained machine learning model comprising a classifier, a trained machine learning model trained using training vectors obtained from training biological samples, a first subset of the training biological samples identified as having a specified characteristic, and a second subset of the training biological samples identified as not having the specified characteristic; and e) applying the trained machine learning model to the feature vectors to obtain an output classification of whether the biological sample has the specified characteristic, thereby distinguishing a population of individuals having the specified characteristic.
In one aspect, the present disclosure provides a method of using a hierarchy that is capable of distinguishing between populations of individuals, comprising: a) Performing an analysis of a plurality of classes of molecules in a biological sample, wherein the analysis provides a plurality of sets of measurements representative of the plurality of classes of molecules; b) Identifying a set of features corresponding to characteristics of each of a plurality of class molecules input into a machine learning or statistical model; c) Preparing a feature vector of feature values from each of the plurality of measurement value sets, each feature value corresponding to a feature of a feature set and including one or more measurement values, wherein the feature vector includes at least one feature value obtained using each of the plurality of measurement value sets; d) Loading into a memory of a computer system: a trained machine learning model comprising a classifier, a trained machine learning model trained using training vectors obtained from training biological samples, a first subset of the training biological samples identified as having a specified characteristic, and a second subset of the training biological samples identified as not having the specified characteristic; and e) applying the trained machine learning model to the feature vectors to obtain an output classification of whether the biological sample has the specified characteristic, thereby distinguishing a population of individuals having the specified characteristic.
In one aspect, the present disclosure provides a method of using a hierarchy that can distinguish a population of individuals, comprising: a) detecting methylation signals in one or more first patient samples in a single sequencing read of a preselected genomic region, b) the methylation signals affect the hierarchy of data output, thereby affecting the machine learning model, and c) a second patient sample using the affected hierarchy to detect methylation signals.
In some embodiments, the preselected genomic region is selected from two or more methylated genomic regions in tables 1-11, three or more methylated genomic regions in tables 1-11, four or more methylated genomic regions in tables 1-11, five or more methylated genomic regions in tables 1-11, six or more methylated genomic regions in tables 1-11, seven or more methylated genomic regions in tables 1-11, eight or more methylated genomic regions in tables 1-11, nine or more methylated genomic regions in tables 1-11, ten or more methylated genomic regions in tables 1-11, eleven or more methylated genomic regions in tables 1-11, twelve or more methylated genomic regions in tables 1-11, or thirteen or more methylated genomic regions in tables 1-11.
In another aspect, the present disclosure provides a method for identifying cancer in a subject, comprising: a) Providing a biological sample containing cell-free nucleic acid (cfNA) molecules from the subject; b) Performing methyl conversion and sequencing on the cfNA molecules from the subject to generate a plurality of cfNA sequencing reads; c) Aligning the plurality of cfNA sequencing reads to a reference genome; d) Generating a quantitative measure of the plurality of cfNA sequencing reads on each of a first plurality of genomic regions of the reference genome to generate a first set of cfNA features, wherein the first plurality of genomic regions of the reference genome comprises at least about 10 distinct regions, each of the at least about 10 distinct regions comprising at least a portion of a gene selected from a methylated region in a signature panel described herein; and e) applying a trained algorithm to the first cfNA feature set to generate a likelihood that the subject has the cancer.
In some examples, the at least about 10 different regions comprise at least about 20 different regions, each of the at least about 20 different regions comprising at least a portion of a methylated region identified in tables 1-11. In some examples, the at least about 10 different regions comprise at least about 30 different regions, each of the at least about 30 different regions comprising at least a portion of a methylated region identified in tables 1-11.
As another example, such predetermined condition may be a specificity of predicting a colon cell proliferative disorder comprising: for example, at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, or at least about 99%.
As another example, such a predetermined condition may be a Positive Predictive Value (PPV) for predicting a colon cell proliferative disorder comprising: for example, at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, or at least about 99%.
As another example, such a predetermined condition may be a Negative Predictive Value (NPV) for a predictive colon cell proliferative disorder comprising: for example, at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 85%, at least about 90%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, or at least about 99%.
As another example, such a predetermined condition may be the area under the curve (AUC) of a Receiver Operating Characteristic (ROC) curve predicting a colon cell proliferative disorder comprising the following values: at least about 0.50, at least about 0.55, at least about 0.60, at least about 0.65, at least about 0.70, at least about 0.75, at least about 0.80, at least about 0.85, at least about 0.90, at least about 0.95, at least about 0.96, at least about 0.97, at least about 0.98, or at least about 0.99.
Responsiveness to treatment
The predictive classifiers, systems, and methods described herein can be used to classify populations of individuals for use in a variety of clinical applications (e.g., based on methylation determinations of biological samples of individuals using signature panels described herein). Examples of such clinical applications include: detecting early stage cancer, diagnosing cancer, classifying cancer into specific disease stages, determining responsiveness or resistance to a therapeutic agent for treating cancer.
The methods and systems described herein are applicable to characteristics of colon cell proliferative disorders, such as grade and stage. Thus, the combination of analyte and assay can be used in the present systems and methods to predict responsiveness of cancer therapeutics of different cancer types in different tissues and classify individuals according to treatment responsiveness. In some embodiments, the classifier described herein is capable of stratifying a group of individuals into treatment responders and non-responders.
The present disclosure also provides a method for determining a drug target (e.g., a gene associated or important to a particular class) for a condition or disease of interest, comprising: assessing the gene expression level of at least one gene in a sample obtained from the individual; and determining genes associated with the classification of the sample using a neighborhood analysis program, thereby determining one or more drug targets associated with the classification.
The present disclosure also provides a method for determining the efficacy of a drug designed to treat a class of diseases, comprising: obtaining a sample from an individual having the disease category; subjecting the sample to a pharmaceutical agent; assessing the gene expression level of at least one gene in the drug-exposed sample; and classifying the drug-exposed sample into a class of diseases as a function of the relative gene expression levels of the sample relative to the model using a computer model created using a weighted voting scheme.
The present disclosure also provides a method for determining the efficacy of a drug designed to treat a disease category, wherein an individual has been affected by the drug, the method comprising obtaining a sample from an individual affected by the drug; assessing the gene expression level of at least one gene in the sample; and classifying the sample into a class of diseases using the model established using a weighted voting scheme, comprising assessing the gene expression level of the sample as compared to the gene expression level of the model.
The present disclosure also provides a method of determining whether an individual belongs to a phenotypic category (e.g., intelligence, response to treatment, length of life, likelihood of viral infection, or obesity) comprising: obtaining a sample from an individual; assessing the gene expression level of at least one gene in the sample; and classifying the sample into a class of diseases using the model established using a weighted voting scheme, comprising assessing the gene expression level of the sample as compared to the gene expression level of the model.
In one aspect, the systems and methods described herein in connection with classifying populations based on therapeutic responsiveness refer to (chemotherapeutic treated) cancers using, but not limited to, classes of DNA damaging agents, targeted therapies for DNA repair, inhibitors of DNA damage signaling, inhibitors of DNA damage-induced cell cycle arrest, and inhibition of processes that indirectly cause DNA damage. Each of these chemotherapeutic agents may be considered a "DNA damage therapeutic" as that term is used herein.
Based on the patient's analyte data, patients can be classified into high risk and low risk patient groups, such as patients with a high or low risk of clinical relapse, and the results can be used to determine course of treatment. For example, patients identified as high risk patients may receive adjuvant chemotherapy after surgery. For patients considered to be low risk patients, adjuvant chemotherapy may be discontinued after surgery. Accordingly, the present disclosure provides, in certain aspects, a method of preparing a colon cancer tumor gene expression profile indicative of risk of recurrence.
In various examples, the classifiers described herein enable stratification of a population of individuals between responders and non-responders to a treatment.
In another aspect, the methods disclosed herein are applicable to clinical applications involving cancer detection or monitoring.
In some embodiments, the methods disclosed herein can be used to determine and/or predict response to treatment.
In some embodiments, the methods disclosed herein can be used to monitor and/or predict tumor burden.
In some embodiments, the methods disclosed herein can be used to detect and/or predict post-operative residual tumors.
In some embodiments, the methods disclosed herein can be used to detect and/or predict minimal residual disease after treatment.
In some embodiments, the methods disclosed herein can be used to detect and/or predict relapse.
In one aspect, the methods disclosed herein can be used as a secondary screening.
In one aspect, the methods disclosed herein can be used as a single screening.
In one aspect, the methods disclosed herein can be used to monitor cancer progression.
In one aspect, the methods disclosed herein can be used to monitor and/or predict cancer risk.
Identification or monitoring of colorectal cancer
Colorectal cancer can be identified or monitored in a subject after processing the data set using a trained algorithm. The identification can be based at least in part on a quantitative measure of a dataset sequence read for a panel of colorectal cancer-associated genomic loci (e.g., a quantitative measure of RNA transcripts or DNA of colorectal cancer-associated genomic loci).
Colorectal cancer can be identified in a subject with the following accuracy: at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 81%, at least about 82%, at least about 83%, at least about 84%, at least about 85%, at least about 86%, at least about 87%, at least about 88%, at least about 89%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or greater. The accuracy of colorectal cancer identification by the trained algorithm can be calculated as the percentage of independent test samples (e.g., subjects known to have colorectal cancer or subjects for which clinical test results for colorectal cancer are negative) that are correctly identified or classified as having or not having colorectal cancer.
Colorectal cancer can be identified in a subject with a Positive Predictive Value (PPV) as follows: at least about 5%, at least about 10%, at least about 15%, at least about 20%, at least about 25%, at least about 30%, at least about 35%, at least about 40%, at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 81%, at least about 82%, at least about 83%, at least about 84%, at least about 85%, at least about 86%, at least about 87%, at least about 88%, at least about 89%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more. The PPV for identifying colorectal cancer using a trained algorithm can be calculated as the percentage of cell-free biological samples identified or classified as having colorectal cancer that correspond to subjects who actually have colorectal cancer.
Colorectal cancer can be identified in a subject with a Negative Predictive Value (NPV) as follows: at least about 5%, at least about 10%, at least about 15%, at least about 20%, at least about 25%, at least about 30%, at least about 35%, at least about 40%, at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 81%, at least about 82%, at least about 83%, at least about 84%, at least about 85%, at least about 86%, at least about 87%, at least about 88%, at least about 89%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more. NPV identified for colorectal cancer using a trained algorithm can be calculated as the percentage of cell-free biological samples identified or classified as not having colorectal cancer that correspond to subjects who actually have colorectal cancer.
Colorectal cancer can be identified in a subject with clinical sensitivity as follows: at least about 5%, at least about 10%, at least about 15%, at least about 20%, at least about 25%, at least about 30%, at least about 35%, at least about 40%, at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 81%, at least about 82%, at least about 83%, at least about 84%, at least about 85%, at least about 86%, at least about 87%, at least about 88%, at least about 89%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, at least about 99.1%, at least about 99.2%, at least about 99.3%, at least about 99.4%, at least about 99.5%, at least about 99.6%, at least about 99.7%, at least about 99.8%, at least about 99.9%, at least about 99.99%, at least about 99.99.99%, at least about 99.999%, or more. Clinical susceptibility to identifying colorectal cancer using a trained algorithm can be calculated as the percentage of independent test samples (e.g., subjects known to have colorectal cancer) that are correctly identified or classified as having colorectal cancer that correlates with the presence of colorectal cancer.
Colorectal cancer can be identified in a subject with clinical specificity as follows: at least about 5%, at least about 10%, at least about 15%, at least about 20%, at least about 25%, at least about 30%, at least about 35%, at least about 40%, at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 81%, at least about 82%, at least about 83%, at least about 84%, at least about 85%, at least about 86%, at least about 87%, at least about 88%, at least about 89%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, at least about 99.1%, at least about 99.2%, at least about 99.3%, at least about 99.4%, at least about 99.5%, at least about 99.6%, at least about 99.7%, at least about 99.8%, at least about 99.9%, at least about 99.99%, at least about 99.99.99%, at least about 99.999%, or more. The clinical specificity of identifying colorectal cancer using a trained algorithm can be calculated as the percentage of independent test samples (e.g., subjects with negative clinical test results for colorectal cancer) that are correlated with the absence of colorectal cancer that are correctly identified or classified as not having colorectal cancer.
In some embodiments, the trained algorithm may determine that the subject is at a risk of colorectal cancer of at least about 5%, at least about 10%, at least about 15%, at least about 20%, at least about 25%, at least about 30%, at least about 35%, at least about 40%, at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 81%, at least about 82%, at least about 83%, at least about 84%, at least about 85%, at least about 86%, at least about 87%, at least about 88%, at least about 89%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, or more.
The trained algorithm can determine that the subject is at risk for colorectal cancer with an accuracy of at least about 50%, at least about 55%, at least about 60%, at least about 65%, at least about 70%, at least about 75%, at least about 80%, at least about 81%, at least about 82%, at least about 83%, at least about 84%, at least about 85%, at least about 86%, at least about 87%, at least about 88%, at least about 89%, at least about 90%, at least about 91%, at least about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, at least about 99.1%, at least about 99.2%, at least about 99.3%, at least about 99.4%, at least about 99.5%, at least about 99.6%, at least about 99.7%, at least about 99.8%, at least about 99.9%, at least about 99.99%, at least about 99.999%, or greater.
Upon identifying a subject as having colorectal cancer, the subject may be provided with therapeutic intervention (e.g., prescribing or administering to the subject an appropriate course of treatment to treat the colorectal cancer). Therapeutic intervention may include prescribing an effective dose of a drug, further detection or assessment of colorectal cancer, further monitoring of colorectal cancer, or a combination thereof. If the subject is currently being treated for colorectal cancer in one course of treatment, the therapeutic intervention may include a subsequent different course of treatment (e.g., increasing the efficacy of the treatment due to the current course of treatment being ineffective). Therapeutic interventions may be described, for example, by "WHO list of priority Medical services for cancer management, WHO Medical device technical services", world Health Organization, ISBN:978-92-4-156546-2, geneva,2017, the contents of which are incorporated herein by reference. Therapeutic interventions may be described by, for example, wolpin et al, "systematic Treatment of clinical Cancer," Gastroenterology, volume 134, stage 5, 2008, pages 1296-1310.E1, the contents of which are incorporated herein by reference.
Therapeutic intervention may comprise advising the subject to perform a secondary clinical examination to confirm the diagnosis of colorectal cancer. This secondary clinical examination may include an imaging examination, a blood examination, a Computed Tomography (CT) scan, a Magnetic Resonance Imaging (MRI) scan, an ultrasound scan, chest X-rays, a Positron Emission Tomography (PET) scan, a PET-CT scan, a cell-free biocytology examination, a stool immunochemistry examination (FIT), a stool occult blood examination (FOBT), or any combination thereof.
Quantitative measures of dataset sequence reads on panels of colorectal cancer-associated genomic loci (e.g., quantitative measures of RNA transcripts or DNA at colorectal cancer-associated genomic loci) can be evaluated over a period of time to monitor a patient (e.g., a subject having colorectal cancer or being treated for colorectal cancer). In this case, the quantitative measure of the patient data set may change during the course of the treatment. For example, a quantitative measure of a patient data set that reduces the risk of colorectal cancer due to an effective treatment may be directed to the profile or distribution of healthy subjects (e.g., subjects that do not suffer from colorectal cancer). Conversely, for example, a quantitative measure of a patient data set that results in an increased risk of colorectal cancer due to ineffective treatment may be directed to a profile or distribution of subjects with higher risk of colorectal cancer or with higher grade or stage of colorectal cancer.
By monitoring the course of treatment of a subject for colorectal cancer, the subject may be monitored for colorectal cancer. Monitoring may comprise assessing colorectal cancer of the subject at two or more time points. The assessment can be based at least on a quantitative measure of sequence reads of the dataset on a panel of colorectal cancer-associated genomic loci (e.g., a quantitative measure of RNA transcripts or DNA at the colorectal cancer-associated genomic loci), including a quantitative measure of a panel of colorectal cancer-associated genomic loci determined at each of two or more time points.
In some embodiments, a difference in a quantitative measure of sequence reads of a dataset (e.g., a quantitative measure of RNA transcript or DNA at a colorectal cancer-associated genomic locus) on a panel of colorectal cancer-associated genomic loci, including a difference in a quantitative measure of a panel of colorectal cancer-associated genomic loci determined between two or more time points, may be indicative of one or more clinical indications, such as: (i) a diagnosis of colorectal cancer in a subject; (ii) prognosis of colorectal cancer in the subject; (iii) an increased risk of the subject to suffer from colorectal cancer; (iv) a reduced risk of the subject to suffer from colorectal cancer; (v) efficacy of a course of treatment to treat colorectal cancer in a subject; and (vi) the course of treatment to treat colorectal cancer in the subject is ineffective.
In some embodiments, a difference in a quantitative measure of sequence reads of the dataset (e.g., a quantitative measure of RNA transcripts or DNA at the colorectal cancer-associated genomic locus) on the panel of colorectal cancer-associated genomic loci, including a difference in a quantitative measure of the panel of colorectal cancer-associated genomic loci determined between two or more time points, can indicate a diagnosis of colorectal cancer in the subject. For example, if the subject does not detect colorectal cancer at an earlier time point, but detects at a later time point, a difference is indicative of a diagnosis of colorectal cancer in the subject. The clinical action or decision may be made based on this indication of the colorectal cancer diagnosis of the subject, e.g., to prescribe or administer a new therapeutic intervention for the subject. The clinical action or decision may include suggesting a secondary clinical examination of the subject to confirm the diagnosis of colorectal cancer. This secondary clinical examination may include an imaging examination, a blood examination, a Computed Tomography (CT) scan, a Magnetic Resonance Imaging (MRI) scan, an ultrasound scan, chest X-rays, a Positron Emission Tomography (PET) scan, a PET-CT scan, a cell-free biocytology examination, a stool immunochemistry examination (FIT), a stool occult blood examination (FOBT), or any combination thereof.
In some embodiments, a difference in a quantitative measure of sequence reads of the dataset (e.g., a quantitative measure of RNA transcripts or DNA at the colorectal cancer-associated genomic locus) on the panel of colorectal cancer-associated genomic loci, including a difference in a quantitative measure of the panel of colorectal cancer-associated genomic loci determined between two or more time points, can indicate a prognosis of the subject's colorectal cancer.
In some embodiments, a difference in a quantitative measure of sequence reads of the dataset (e.g., a quantitative measure of RNA transcripts or DNA at the colorectal cancer-associated genomic locus) on the panel of colorectal cancer-associated genomic loci, including a difference in a quantitative measure of the panel of colorectal cancer-associated genomic loci determined between two or more time points, may indicate an increased risk of the subject to develop colorectal cancer. For example, if the subject detects colorectal cancer at both an earlier time point and a later time point, and if the difference is positive (e.g., a quantitative measure of sequence reads of the dataset (e.g., a quantitative measure of RNA transcript or DNA at a colorectal cancer-associated genomic locus) is increasing from the earlier time point to the later time point) the difference may indicate that the subject is at increased risk of developing colorectal cancer. A clinical action or decision may be made based on this indication of an increased risk of colorectal cancer, e.g., prescribing or administering a new therapeutic intervention or switching therapeutic intervention to the subject (e.g., ending the current therapy and prescribing or administering a new therapy). The clinical action or decision may include suggesting that the subject undergo a secondary clinical examination to confirm an increased risk of developing colorectal cancer. This secondary clinical examination may include an imaging examination, a blood examination, a Computed Tomography (CT) scan, a Magnetic Resonance Imaging (MRI) scan, an ultrasound scan, chest X-rays, a Positron Emission Tomography (PET) scan, a PET-CT scan, a cell-free biocytology examination, a stool immunochemistry examination (FIT), a stool occult blood examination (FOBT), or any combination thereof.
In some embodiments, a difference in a quantitative measure of sequence reads of the dataset (e.g., a quantitative measure of RNA transcripts or DNA at the colorectal cancer-associated genomic locus) on the panel of colorectal cancer-associated genomic loci, including a difference in a quantitative measure of the panel of colorectal cancer-associated genomic loci determined between two or more time points, may indicate a reduced risk of the subject to develop colorectal cancer. For example, if the subject detects colorectal cancer at both an earlier time point and a later time point, and if the difference is a negative difference (e.g., a quantitative measure of sequence reads of the dataset on a panel of colorectal cancer-associated genomic loci (e.g., a quantitative measure of RNA transcripts or DNA on a panel of colorectal cancer-associated genomic loci), including a quantitative measure of a panel of colorectal cancer-associated genomic loci, is decreasing from the earlier time point to the later time point), the difference may indicate that the subject is at reduced risk of developing colorectal cancer. A clinical action or decision may be made for the subject (e.g., to continue or end the current therapeutic intervention) based on this indication of reduced risk of colorectal cancer. The clinical action or decision may include suggesting that the subject undergo a secondary clinical examination to confirm a reduced risk of colorectal cancer. This secondary clinical examination may include a radiographic examination, a blood examination, a Computed Tomography (CT) scan, a Magnetic Resonance Imaging (MRI) scan, an ultrasound scan, chest X-rays, a Positron Emission Tomography (PET) scan, a PET-CT scan, a cell-free biocytological examination, a fecal immunochemical examination (FIT), a fecal occult blood examination (FOBT), or any combination thereof.
In some embodiments, a difference in a quantitative measure of sequence reads of the dataset (e.g., a quantitative measure of RNA transcripts or DNA at the colorectal cancer-associated genomic locus) on the panel of colorectal cancer-associated genomic loci, including a difference in a quantitative measure of the panel of colorectal cancer-associated genomic loci determined between two or more time points, can indicate the efficacy of a course of treatment for colorectal cancer in a subject. For example, if a subject detects colorectal cancer at an earlier time point but not at a later time point, a difference may indicate the efficacy of the treatment process to treat the subject for colorectal cancer. A clinical action or decision may be made based on this indication of the efficacy of the course of treatment for colorectal cancer in the subject, e.g., to continue or end the current therapeutic intervention for the subject. The clinical action or decision may include suggesting that the subject undergo a secondary clinical examination to confirm the efficacy of the course of treatment to treat the subject for colorectal cancer. This secondary clinical examination may include an imaging examination, a blood examination, a Computed Tomography (CT) scan, a Magnetic Resonance Imaging (MRI) scan, an ultrasound scan, chest X-rays, a Positron Emission Tomography (PET) scan, a PET-CT scan, a cell-free biocytology examination, a stool immunochemistry examination (FIT), a stool occult blood examination (FOBT), or any combination thereof.
In some embodiments, a difference in a quantitative measure of sequence reads of the dataset (e.g., a quantitative measure of RNA transcripts or DNA at the colorectal cancer-associated genomic locus) on the panel of colorectal cancer-associated genomic loci, including a difference in a quantitative measure of the panel of colorectal cancer-associated genomic loci determined between two or more time points, can indicate that the course of treatment for colorectal cancer in the subject is ineffective. For example, if the subject detects colorectal cancer at both an earlier time point and a later time point, and if the difference is positive or zero (e.g., a quantitative measure of sequence reads of the dataset (e.g., a quantitative measure of RNA transcript or DNA at a colorectal cancer-associated genomic locus) on a panel of colorectal cancer-associated genomic loci, including a quantitative measure of the panel of colorectal cancer-associated genomic loci, is increasing or remains at a constant level from the earlier time point to the later time point), the difference may indicate that the course of treatment to treat the subject for colorectal cancer is ineffective. A clinical action or decision may be made based on this indication that the course of treatment to treat the subject for colorectal cancer is ineffective, e.g., to conclude the subject with a current therapeutic intervention and/or to switch (e.g., prescribe or administer) a new, different therapeutic intervention. The clinical action or decision may include advising the subject to perform a secondary clinical examination to confirm the ineffectiveness of the treatment for colorectal cancer in the subject. This secondary clinical examination may include an imaging examination, a blood examination, a Computed Tomography (CT) scan, a Magnetic Resonance Imaging (MRI) scan, an ultrasound scan, chest X-rays, a Positron Emission Tomography (PET) scan, a PET-CT scan, a cell-free biocytology examination, a stool immunochemistry examination (FIT), a stool occult blood examination (FOBT), or any combination thereof.
VIII. Kit
The present disclosure provides kits for identifying or monitoring cancer in a subject. The kit can include probes for identifying quantitative measures of sequence (e.g., indicative of presence, absence, or relative quantity) at each of a plurality of cancer-associated genomic loci in a cell-free biological sample of a subject. A quantitative measure of sequence (e.g., indicative of presence, absence, or relative quantity) at each of a plurality of cancer-associated genomic loci in a cell-free biological sample can be indicative of one or more cancers. The probe may be selective for sequences at a plurality of cancer-associated genomic loci in the cell-free biological sample. The kit can include instructions for processing the cell-free biological sample using the probe to generate a data set indicative of a quantitative measure of sequence (e.g., indicative of presence, absence, or relative quantity) at each of a plurality of cancer-associated genomic loci in the cell-free biological sample of the subject.
The probes in the kit can be selective for sequences at a plurality of cancer-associated genomic loci in a cell-free biological sample. The probes in the kit can be configured to selectively enrich for nucleic acid (e.g., RNA or DNA) molecules corresponding to a plurality of cancer-associated genomic loci. The probes in the kit may be nucleic acid primers. The probes in the kit can have sequence complementarity with nucleic acid sequences from one or more of a plurality of cancer-associated genomic loci or genomic regions. The plurality of cancer-associated genomic loci or genomic regions can comprise at least 2, at least 3, at least 4, at least 5, at least 6, at least 7, at least 8, at least 9, at least 10, at least 11, at least 12, at least 13, at least 14, at least 15, at least 16, at least 17, at least 18, at least 19, at least 20, or more different cancer-associated genomic loci or genomic regions. The plurality of cancer-associated genomic loci or genomic regions can comprise one or more members selected from the regions listed in tables 1-11.
The instructions in the kit can include instructions for assaying the cell-free biological sample using a probe that is selective for sequences at a plurality of cancer-associated genomic loci in the cell-free biological sample. These probes may be nucleic acid molecules (e.g., RNA or DNA) having sequence complementarity with a nucleic acid sequence (e.g., RNA or DNA) from one or more of a plurality of cancer-associated genomic loci. These nucleic acid molecules may be primers or enrichment sequences. Instructions for determining the cell-free biological sample can include instructions for performing array hybridization, polymerase Chain Reaction (PCR), or nucleic acid sequencing (e.g., DNA sequencing or RNA sequencing) to process the cell-free biological sample thereby generating a dataset indicative of a quantitative measure of sequence (e.g., indicative of presence, absence, or relative quantity) at each of a plurality of cancer-associated genomic loci in the cell-free biological sample. A quantitative measure of sequence (e.g., indicative of presence, absence, or relative quantity) at each of a plurality of cancer-associated genomic loci in a cell-free biological sample can be indicative of one or more cancers.
The instructions in the kit can include instructions to measure and interpret an assay readout that can be quantified at one or more of the plurality of cancer-associated genomic loci to generate a dataset indicative of a quantitative measure of sequence (e.g., indicative of presence, absence, or relative quantity) at each of the plurality of cancer-associated genomic loci in the cell-free biological sample. For example, array hybridization or quantification of Polymerase Chain Reaction (PCR) corresponding to a plurality of cancer-associated genomic loci can generate a data set indicative of a quantitative measure of sequence (e.g., indicative of presence, absence, or relative quantity) at each of the plurality of cancer-associated genomic loci in the cell-free biological sample. Assay readout can include quantitative PCR (qPCR) values, digital PCR (dPCR) values, digital droplet PCR (ddPCR) values, fluorescence values, and the like, or normalized values thereof.
Examples
Example 1: selection of methylated regions for colorectal cancer detection
For colorectal cancer, using the systems and methods of the present disclosure, 20 highly methylated genomic regions were identified in the tumor, but multiple normal tissues did not exhibit methylation of these regions. These regions serve as highly specific markers for the presence of tumors, with little or no background signal.
In table 12, 'position start-position end' specifies the coordinates of the target region in the human genome reference sequence hg18 construct. Gene ID and chromosome field refer to the gene and chromosome number associated with the numbered region. Examination of these sequences relative to adjacent genes indicates that they are found in the upstream, 5 'promoter, 5' enhancer, intron, exon, distal promoter, coding region, or intergenic region, respectively.
Use of
Figure BDA0003971977860000805
Cell-free DNA isolation kit (Applied)
Figure BDA0003971977860000802
) Cell-free DNA (spiked with unique synthetic double stranded DNA (dsDNA) fragments for sample tracking) was extracted from 250 microliters (μ L) of plasma according to the manufacturer's instructions. Use of
Figure BDA0003971977860000803
Ultra II DNA library preparation kit (New England)
Figure BDA0003971977860000801
) Preparing paired-end sequencing libraries, including Polymerase Chain Reaction (PCR) amplification and Unique Molecular Identifiers (UMI), and using
Figure BDA0003971977860000804
The NovaSeq 6000 sequencing system sequenced 2x5 l base pairs on multiple S2 or S4 flow cells up to a minimum of 4 hundred million reads (median =6.36 hundred million reads).
Probe for colorectal cancer
PCR primer pairs were developed to different regions of the genome that showed extensive methylation in multiple colorectal cancer samples from the TOGA database, but no or little methylation in multiple normal tissues and blood cells (peripheral blood mononuclear cells and others).
These primers are then used to amplify the transformed DNA from a plasma sample of an individual at risk of having colorectal cancer. The sequencing adaptors were ligated to the DNA and next generation sequencing was performed. Sequencing reads were then isolated by region and analyzed using tools such as the BiQ Analyzer HT program.
The obtained sequencing reads were demultiplexed, aptamer trimmed, and aligned to the human reference genome (GRCh 38 with bait, alt contig, and HLA contig) using a Burrows Wheeler aligner (BWA-MEM 0.7.15). PCR replicative fragments, if present, are removed using fragment endpoints and/or UMI.
cfDNA "profiles" were created for each sample by counting the number of fragments aligned with each putative protein coding region in the genome. This type of data demonstrates epigenetic changes showing protection of cfDNA by variable nucleosomes, resulting in changes in the observed fragments with increased coverage and methylation compared to controls.
A set of functional regions of the human genome, including putative protein-encoding gene regions (genome coordinate ranges including introns and exons), is annotated in the sequencing data. Annotations of the protein-encoding gene regions ("gene" regions) were obtained from the integrated human expression sequence (CHESS) project (v 1.0).
The results obtained are as follows.
Table 12 provides a collection of hypermethylated genomic regions in a cell-free nucleic acid sample identified from a sample of an individual having colorectal cancer. For each region, an exemplary number of methylated CpG sites in the region is provided as a threshold for distinguishing healthy individuals from CRC individuals.
TABLE 12
Figure BDA0003971977860000811
Figure BDA0003971977860000821
In the discussion herein, references to genes such as ITGA4, TMEM163, and SFMBT2, for example, may not indicate the gene of interest itself, but rather the relevant methylation regions described in the signature panels.
A total of 50 regions were found to be hypermethylated in association with CRC. Not all regions need to be included in the classification model in order to distinguish healthy individuals from CRC individuals. Thus, some regions appear to be generally indicative of various types of cancer being assessed. Other regions are methylated in these subgroups, while the remainder are specific for cancer. In the context of this determination and the type of cancer examined, certain regions may be described as "specifically methylated in colorectal cancer" and having higher weight in the signature when training the sample sequence in the predictive model. These more heavily weighted methylation regions associated with CRC are used in a particular model that is trained to distinguish between a population of healthy individuals and a population of CRC individuals.
Example 2: construction and training of classification models for differentiating populations of colorectal cancer individuals
Using the systems and methods of the present disclosure, a machine learning classification model is constructed and trained using artificial intelligence-based methods to analyze cfDNA data acquired from a subject (generate a diagnostic output for a subject with colorectal cancer).
Expected human plasma samples were obtained from 49 patients diagnosed with CRC. In addition, a set of 92 control samples were obtained from patients who currently had no diagnosis of cancer (but may have other co-morbid or undiagnosed cancer). All samples were de-identified.
The age, sex, and cancer stage (where available) of each patient were taken for each sample. Plasma samples collected from each patient were stored at-80 ℃ and thawed prior to use. Table 13 provides a description of the study cohort showing the number of healthy and cancer samples (divided by stage, sex, and age) used for the CRC experiment.
Watch 13
Figure BDA0003971977860000831
Figure BDA0003971977860000841
The samples were processed and sequenced according to the methods described herein, in particular the method described in example 1. The methylated regions in table 12 are used exclusively to determine methylated CpG status between healthy and colorectal cancer individuals. For each region listed in column 1 of table 12, the threshold number of CpG sites shown in column 2 was used to define methylated fragments for analysis. The remaining fragments are classified as methylated if there are multiple CpG sites above the threshold; otherwise, these fragments are classified as unmethylated. To calculate the raw score for each sample, given by the number of methylated fragments in each sample that overlap the regions listed in table 12, these counts for each sample were summarized across regions. The raw scores for each sample were normalized to account for the coverage difference for each sample. The raw score for each sample is multiplied by a sample-specific scaling factor given by the total number of samples divided by the pre-specified target coverage level. These normalized and scaled methylation ratios are output as scores for each sample. Threshold scores were selected according to the desired specific targets from the training set. Samples were classified as positive or negative based on whether the scores of these samples exceeded this threshold. An ROC curve is generated by considering the grade of the sample with the score or considering a threshold value.
The machine learning classification model was trained as described above and parameters were selected on an independently presented sample set. A machine-learned classification model was applied to the samples described in table 13. The healthy sample with the highest proportion of hypermethylated fragment counts was selected as the cutoff for classifying new samples as positive or negative. The area under the ROC curve (AUC) was calculated based on the training set using the rank derived from normalized hypermethylated fragment counts. Sensitivity and specificity were calculated using the selected cut-off values. Confidence intervals for sensitivity and specificity were calculated using the cloner-Pearson confidence interval, and the confidence intervals for AUC were calculated using the method described by Fay, m, and Malinovsky, y., staticisics in Medicine 37 (27): 3991-4006 (2018), the contents of which are incorporated herein by reference.
The average area under the curve (AUC) for this method was 0.9488 (0.87-0.98), and the average sensitivity for IU samples was 70% (0.49-0.87) at 92% specificity (0.86-0.96) (FIG. 2).
Example 3: detection and individual classification of cell-free samples
Using the systems and methods of the present disclosure, predictive analysis is performed using artificial intelligence-based methods to analyze cfDNA data obtained from a subject to generate a diagnostic output for the subject having colorectal cancer.
A method of predicting an increased risk of developing or developing cancer is provided herein for asymptomatic patients, wherein a model trained from a signature panel of the process provided in example 1 is applied to the measured biomarker panel and clinical factors of age and gender are used to identify those patients having or developing an increased risk of colorectal cancer. In embodiments, this method and the inventive classifier model use input variables for biomarkers measured in a normal clinical range, wherein the colorectal cancer classifier model uses the age input variables and measurements from a panel of biomarkers from the patient to classify the patient into a category of increased risk when the output of the first classifier model is above a calculated threshold based on the number of methylated CpG sites within the region.
Genes were selected according to example 1 with the aim of selecting marker genes and CpG sites with strong differential methylation (beta difference, i.e. difference between methylation specific probe and methylation non-specific probe and p-value), predictive power (AUC) and effect on gene expression (p-value from gene expression).
This selection results in a signature panel as provided herein, which contains methylated regions that can distinguish between healthy samples and CRC samples. The first subset of regions comprises 20 regions with at least 4 to 18 CpG sites with increased methylation that map to 18 genes (many genes are represented by many CpG sites).
The cfDNA CpG count profile presentation of the input cfDNA can serve as an unbiased presentation of available methylation signals in the blood, allowing capture of signals directly from tumors as well as those from non-tumor sources such as the circulating immune system or the tumor microenvironment.
Unsupervised clustering based on these genes showed a clear methylation pattern associated with either the healthy or CRC phenotype.
To assess the accuracy of the methylated regions for early detection of CRC, receiver Operating Characteristic (ROC) curves and the area under the ROC curve (AUC) of the regions in the signature panel were calculated. ROC results are shown in fig. 3A-3F, showing the ability of these Differentially Methylated Regions (DMR) to detect CRC and differentiate early stage cancers, including patients with stage 1 (fig. 3A), stage 2 (fig. 3B), stage 3 (fig. 3C), stage 4 (fig. 3D), deletion stage (fig. 3E), and all samples (fig. 3F). A total of 80 gene regions associated with increased methylation were identified. Areas of methylation with average methylation levels are progressively increased compared to controls, or can be used to distinguish early from late stages of CRC. For example, the methylated regions associated with Table 12 have higher CRC detectability [ CRC versus control AUC =0.924 (95% CI:0.752 to 0.954) ].
As summarized in table 14, the results indicate that early cancer detection from blood (e.g., in a collection of 13 stage I and II samples) has superior performance.
TABLE 14
Figure BDA0003971977860000861
While preferred embodiments of the present invention have been shown and described herein, it will be obvious to those skilled in the art that such embodiments are provided by way of example only. The present invention is not intended to be limited to the specific embodiments provided within this specification. While the present invention has been described with reference to the foregoing detailed description, the descriptions and illustrations of the embodiments herein are not intended to be construed in a limiting sense. Numerous variations, changes, and substitutions will now occur to those skilled in the art without departing from the invention. Further, it is to be understood that all aspects of the present invention are not limited to the specific depictions, configurations or relative proportions set forth herein which depend upon a variety of conditions and variables. It should be understood that various alternatives to the embodiments of the invention described herein may be employed in practicing the invention. It is therefore contemplated that the present invention shall also cover any such alternatives, modifications, variations or equivalents. It is intended that the following claims define the scope of the invention and that methods and structures within the scope of these claims and their equivalents be covered thereby.

Claims (49)

1. A methylation signature panel specific for a colon cell proliferative disorder, comprising:
one or more methylated genomic regions selected from table 11, wherein the one or more regions are more methylated in a biological sample from an individual having a colonic cell proliferative disorder or a subtype of a colonic cell proliferative disorder and are less methylated in normal tissue and normal blood cells of an individual not having a colonic cell proliferative disorder.
2. The methylation signature panel of claim 1, wherein the biological sample is nucleic acid, DNA, RNA, or cell-free nucleic acid (cfDNA or cfRNA).
3. The methylated signature panel of claim 1, wherein the signature panel comprises increased methylation in two or more genomic regions selected from table 11.
4. The methylation signature panel of claim 1, wherein said colon cell proliferative disorder is selected from the group consisting of: adenomas (adenomatous polyps), sessile Serrated Adenomas (SSA), advanced adenomas, colorectal dysplasias, colorectal adenomas, colorectal cancer, colon cancer, rectal cancer, colorectal epithelial carcinoma, colorectal adenocarcinoma, carcinoid tumors, gastrointestinal stromal tumors (GIST), lymphomas, and sarcomas.
5. The methylation signature panel of claim 1, wherein the colon cell proliferative disorder is selected from the group consisting of stage 1 colorectal cancer, stage 2 colorectal cancer, stage 3 colorectal cancer, and stage 4 colorectal cancer.
6. The methylation signature panel of claim 1, wherein the signature panel comprises two or more methylation genomic regions in tables 1-11, three or more methylation genomic regions in tables 1-11, four or more methylation genomic regions in tables 1-11, five or more methylation genomic regions in tables 1-11, six or more methylation genomic regions in tables 1-11, seven or more methylation genomic regions in tables 1-11, eight or more methylation genomic regions in tables 1-11, nine or more methylation genomic regions in tables 1-11, ten or more methylation genomic regions in tables 1-11, eleven or more methylation genomic regions in tables 1-11, twelve or more methylation genomic regions in tables 1-11, or thirteen or more methylation genomic regions in tables 1-11.
7. The methylation signature panel of claim 1, wherein the signature panel comprises genomic regions that are methylated in colorectal cancer, including methylated regions in one or more genomic regions selected from the group consisting of: IKZF1, KCNQ5, ELMO1, CHST2, PRKCB and FLI1.
8. The methylation signature panel of claim 1, wherein the region that is methylated in colorectal cancer comprises a methylated region selected from the group consisting of: IKZF1, KCNQ5 and ELMO1 genomic regions.
9. The methylation signature panel of claim 1, wherein the region that is methylated in colorectal cancer comprises a methylated region in one or more genomic regions selected from the group consisting of: IKZF1, KCNQ5, ELMO1, CHST2, PRKCB, FLI1, CLIP4, ELOVL5, FAM72B and ST3GAL1.
10. The methylated signature panel of claim 1, wherein the signature panel comprises methylated genomic regions selected from table 1, table 2, table 3, table 4, table 5, table 6, table 7, table 8, table 9, table 10 and table 11.
11. A methylation signature panel specific for a colon cell proliferative disorder, comprising:
two or more methylated genomic regions selected from tables 1-11, wherein the two or more regions are more methylated in a biological sample from an individual having a colonic cell proliferative disorder or a subtype of a colonic cell proliferative disorder and are less methylated in normal tissue and normal blood cells of an individual not having a colonic cell proliferative disorder.
12. The methylation signature panel of claim 11, wherein the biological sample is a nucleic acid, DNA, RNA, or cell-free nucleic acid.
13. The methylated signature panel of claim 11, wherein the signature panel comprises an increase in methylation in 6 or more genomic regions selected from tables 1-11.
14. The methylation signature panel of claim 11, wherein the colon cell proliferative disorder is selected from the group consisting of: adenomas (adenomatous polyps), sessile Serrated Adenomas (SSA), advanced adenomas, colorectal dysplasias, colorectal adenomas, colorectal cancer, colon cancer, rectal cancer, colorectal epithelial carcinoma, colorectal adenocarcinoma, carcinoid tumors, gastrointestinal stromal tumors (GIST), lymphomas, and sarcomas.
15. The methylation signature panel of claim 11, wherein the colon cell proliferative disorder is selected from the group consisting of stage 1 colorectal cancer, stage 2 colorectal cancer, stage 3 colorectal cancer, and stage 4 colorectal cancer.
16. The methylation signature panel of claim 11, wherein the signature panel comprises three or more methylation genomic regions in tables 1-11, four or more methylation genomic regions in tables 1-11, five or more methylation genomic regions in tables 1-11, six or more methylation genomic regions in tables 1-11, seven or more methylation genomic regions in tables 1-11, eight or more methylation genomic regions in tables 1-11, nine or more methylation genomic regions in tables 1-11, ten or more methylation genomic regions in tables 1-11, eleven or more methylation genomic regions in tables 1-11, twelve or more methylation genomic regions in tables 1-11, or thirteen or more methylation genomic regions in tables 1-11.
17. The methylation signature panel of claim 11, wherein the signature panel comprises genomic regions that are methylated in colorectal cancer, including methylated regions in one or more genomic regions selected from the group consisting of: IKZF1, KCNQ5, ELMO1, CHST2, PRKCB and FLI1.
18. The methylation signature panel of claim 11, wherein the region that is methylated in colorectal cancer comprises a methylated region selected from the group consisting of: IKZF1, KCNQ5 and ELMO1 genomic regions.
19. The methylation signature panel of claim 11, wherein said regions that are methylated in colorectal cancer comprise methylated regions in one or more genomic regions selected from the group consisting of: IKZF1, KCNQ5, ELMO1, CHST2, PRKCB, FLI1, CLIP4, ELOVL5, FAM72B and ST3GAL1.
20. The methylated signature panel of claim 11, wherein the signature panel comprises methylated genomic regions selected from table 1, table 2, table 3, table 4, table 5, table 6, table 7, table 8, table 9, table 10 and table 11.
21. A machine-learned classifier capable of distinguishing a population of healthy individuals from a population of individuals with a colonic cell proliferative disorder, comprising:
a) A set of measurements representative of the differentially methylated genomic region of claim 1, wherein the measurements are obtained from methylation sequencing data from a healthy subject and a subject having a colon cell proliferative disorder;
b) Wherein the measurements are used to generate a set of features corresponding to characteristics of the differentially methylated genomic region, and wherein the features are input to a machine learning or statistical model;
c) Wherein the model provides feature vectors that serve as a classifier capable of distinguishing a population of healthy individuals from a population of individuals with a colonic cell proliferative disorder.
22. The classifier of claim 21 wherein the set of measurements describes characteristics of a methylation region selected from the group consisting of: percent base-by-base methylation of CpG, CHG, CHH, counts or ratios of fragments with different counts or ratios of methylated CpG observed in a region, conversion efficiency (100-average percent methylation of CHH), hypomethylated segments, methylation level (global average methylation of CpG, CHH, CHG, fragment length, mid-point of a fragment, number of methylated CpG per fragment, fraction of CpG methylation per fragment to total CpG, fraction of CpG methylation per region to total CpG, fraction of CpG methylation in panels to total CpG, dinucleotide coverage (normalized dinucleotide coverage), uniformity of coverage (unique CpG sites under 1x and 10x average genome coverage (run for S4)), global average CpG coverage (depth), and average coverage at CpG islands, CGI scaffolds and CGI banks.
23. A system for detecting a colon cell proliferative disorder comprising a machine learning model classifier, comprising:
a) A computer readable medium comprising a classifier operable to classify a subject as having or not having a colon cell proliferative disorder according to a methylation signature panel; and
b) One or more processors configured to execute instructions stored on the computer-readable medium.
24. The system of claim 23, comprising the classifier of claim 21 loaded into memory of a computer system, a machine learning model trained using training vectors obtained from training biological samples, a first subset of the training biological samples identified as having a colonic cell proliferative disorder, and a second subset of the training biological samples identified as not having a colonic cell proliferative disorder.
25. A method for determining a methylation profile of a cell-free deoxyribonucleic acid (cfDNA) sample from an individual, comprising:
a) Providing conditions capable of converting unmethylated cytosines to uracil in nucleic acid molecules of the cfDNA sample to produce a plurality of converted nucleic acids;
b) Contacting the plurality of transformed nucleic acids with a nucleic acid probe complementary to a pre-identified methylation signature panel selected from at least two differentially methylated regions of tables 1-11 to enrich for sequences corresponding to the signature panel;
c) Determining the nucleic acid sequence of the plurality of transformed nucleic acid molecules; and
d) Aligning the nucleic acid sequences of the plurality of transforming nucleic acid molecules with a reference nucleic acid sequence, thereby determining the methylation profile of the individual.
26. The method of claim 25, further comprising amplifying the plurality of transformed nucleic acids.
27. The method of claim 26, wherein the amplification comprises Polymerase Chain Reaction (PCR).
28. The method of claim 25, further comprising determining the nucleic acid sequence of the transformed nucleic acid molecule at a depth of greater than 1000x, greater than 2000x, greater than 3000x, greater than 4000x, or greater than 5000 x.
29. The method of claim 25, wherein the reference nucleic acid sequence is at least a portion of a human reference genome.
30. The method of claim 29, wherein the human reference genome is hg18.
31. The method of claim 25, wherein the pre-identified methylation signature panel comprises three or more methylated genomic regions in tables 1-11, four or more methylated genomic regions in tables 1-11, five or more methylated genomic regions in tables 1-11, six or more methylated genomic regions in tables 1-11, seven or more methylated genomic regions in tables 1-11, eight or more methylated genomic regions in tables 1-11, nine or more methylated genomic regions in tables 1-11, ten or more methylated genomic regions in tables 1-11, eleven or more methylated genomic regions in tables 1-11, twelve or more methylated genomic regions in tables 1-11, or thirteen or more methylated genomic regions in tables 1-11.
32. The method of claim 31, wherein the pre-identified methylation signature panel comprises one or more methylated genomic regions from table 11, two or more methylated genomic regions from table 11, or three methylated genomic regions from table 11.
33. The method of claim 25, wherein the methylation profile is indicative of the presence or absence of a colon cell proliferative disorder in the individual.
34. The method of claim 33, wherein the colon cell proliferative disorder is selected from the group consisting of: adenomas (adenomatous polyps), sessile Serrated Adenomas (SSA), advanced adenomas, colorectal dysplasias, colorectal adenomas, colorectal cancer, colon cancer, rectal cancer, colorectal epithelial carcinoma, colorectal adenocarcinoma, carcinoid tumors, gastrointestinal stromal tumors (GIST), lymphomas, and sarcomas.
35. The method of claim 33, wherein the colorectal cell proliferative disorder is selected from stage 1 colorectal cancer, stage 2 colorectal cancer, stage 3 colorectal cancer, or stage 4 colorectal cancer.
36. A method for detecting the presence or absence of a proliferative disorder of colon cells in a subject, comprising:
a) Providing conditions capable of converting unmethylated cytosines to uracil in nucleic acid molecules of a biological sample obtained or derived from the subject to produce a plurality of converted nucleic acids;
b) Contacting the plurality of transformed nucleic acids with a nucleic acid probe complementary to a pre-identified methylation signature panel selected from at least two differentially methylated regions of tables 1-11 to enrich for sequences corresponding to the signature panel;
c) Determining the nucleic acid sequence of the transformed nucleic acid molecule;
d) Aligning the nucleic acid sequences of the plurality of transforming nucleic acid molecules with a reference nucleic acid sequence, thereby determining a methylation profile of the individual; and
e) Applying a trained machine learning classifier to the methylation profile, wherein the trained machine learning classifier is trained to be able to distinguish between a healthy individual and an individual having a colonic cell proliferative disorder to provide an output value related to the presence of a colonic cell proliferative disorder, thereby detecting the presence or absence of the colonic cell proliferative disorder in the subject.
37. The method of claim 36, wherein the biological sample obtained from the subject is selected from the group consisting of: cell-free DNA, cell-free RNA, bodily fluids, stool, colonic discharge, urine, plasma, serum, whole blood, isolated blood cells, cells isolated from blood, and combinations thereof.
38. The method of claim 36, further comprising amplifying the plurality of transformed nucleic acids.
39. The method of claim 38, wherein the amplification comprises Polymerase Chain Reaction (PCR).
40. The method of claim 36, further comprising determining the nucleic acid sequence of the transformed nucleic acid molecule at a depth of greater than 1000x, greater than 2000x, greater than 3000x, greater than 4000x, or greater than 5000 x.
41. The method of claim 36, wherein the reference nucleic acid sequence is at least a portion of a human reference genome.
42. The method of claim 41, wherein the human reference genome is hg18.
43. The method of claim 36, wherein the pre-identified methylation signature panel comprises three or more methylated genomic regions in tables 1-11, four or more methylated genomic regions in tables 1-11, five or more methylated genomic regions in tables 1-11, six or more methylated genomic regions in tables 1-11, seven or more methylated genomic regions in tables 1-11, eight or more methylated genomic regions in tables 1-11, nine or more methylated genomic regions in tables 1-11, ten or more methylated genomic regions in tables 1-11, eleven or more methylated genomic regions in tables 1-11, twelve or more methylated genomic regions in tables 1-11, or thirteen or more methylated genomic regions in tables 1-11.
44. The method of claim 43, wherein the pre-identified methylation signature panel comprises one or more methylated genomic regions of Table 11, two or more methylated genomic regions of Table 11, or three methylated genomic regions of Table 11.
45. The method of claim 36, further comprising administering to the individual a treatment for the colon cell proliferative disorder based on detecting the presence of the colon cell proliferative disorder in the individual.
46. The method of claim 36, wherein the colon cell proliferative disorder is selected from the group consisting of: adenomas (adenomatous polyps), sessile Serrated Adenomas (SSA), advanced adenomas, colorectal dysplasias, colorectal adenomas, colorectal cancer, colon cancer, rectal cancer, colorectal epithelial carcinoma, colorectal adenocarcinoma, carcinoid tumors, gastrointestinal stromal tumors (GIST), lymphomas, and sarcomas.
47. The method of claim 36, wherein the colon cell proliferative disorder comprises colorectal cancer.
48. The method of claim 36, wherein the colorectal cell proliferative disorder is selected from stage 1 colorectal cancer, stage 2 colorectal cancer, stage 3 colorectal cancer, and stage 4 colorectal cancer.
49. The method of claim 36, wherein the trained machine learning classifier is selected from the group consisting of: deep learning classifiers, neural network classifiers, linear Discriminant Analysis (LDA) classifiers, quadratic Discriminant Analysis (QDA) classifiers, support Vector Machine (SVM) classifiers, random Forest (RF) classifiers, linear kernel support vector machine classifiers, first-order or second-order polynomial kernel support vector machine classifiers, ridge regression classifiers, elastic net algorithm classifiers, sequence-minimization optimization algorithm classifiers, naive Bayes algorithm classifiers, and principal component analysis classifiers.
CN202180039398.8A 2020-03-31 2021-03-29 Method and system for detecting colorectal cancer by nucleic acid methylation analysis Pending CN115667554A (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US202063002878P 2020-03-31 2020-03-31
US63/002,878 2020-03-31
PCT/US2021/024604 WO2021202351A1 (en) 2020-03-31 2021-03-29 Methods and systems for detecting colorectal cancer via nucleic acid methylation analysis

Publications (1)

Publication Number Publication Date
CN115667554A true CN115667554A (en) 2023-01-31

Family

ID=77929568

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202180039398.8A Pending CN115667554A (en) 2020-03-31 2021-03-29 Method and system for detecting colorectal cancer by nucleic acid methylation analysis

Country Status (8)

Country Link
US (2) US20230101485A1 (en)
EP (1) EP4127215A1 (en)
JP (1) JP2023524627A (en)
KR (1) KR20230017169A (en)
CN (1) CN115667554A (en)
AU (1) AU2021245992A1 (en)
CA (1) CA3178302A1 (en)
WO (1) WO2021202351A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115497561A (en) * 2022-09-01 2022-12-20 北京吉因加医学检验实验室有限公司 Method and device for layering screening of methylation markers
CN116298295A (en) * 2023-05-18 2023-06-23 上海秤信生物科技有限公司 Tumor autoantigen/antibody combination for early detection of colorectal cancer and application thereof

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019060716A1 (en) 2017-09-25 2019-03-28 Freenome Holdings, Inc. Methods and systems for sample extraction
US11788152B2 (en) 2022-01-28 2023-10-17 Flagship Pioneering Innovations Vi, Llc Multiple-tiered screening and second analysis
WO2023164017A2 (en) * 2022-02-22 2023-08-31 Flagship Pioneering Innovations Vi, Llc Intra-individual analysis for presence of health conditions
WO2023225560A1 (en) 2022-05-17 2023-11-23 Guardant Health, Inc. Methods for identifying druggable targets and treating cancer
WO2024056008A1 (en) * 2022-09-16 2024-03-21 江苏鹍远生物科技股份有限公司 Methylation marker for identifying cancer and use thereof

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8383338B2 (en) * 2006-04-24 2013-02-26 Roche Nimblegen, Inc. Methods and systems for uniform enrichment of genomic regions
WO2012174256A2 (en) * 2011-06-17 2012-12-20 The Regents Of The University Of Michigan Dna methylation profiles in cancer
CA2902916C (en) * 2013-03-14 2018-08-28 Mayo Foundation For Medical Education And Research Detecting neoplasm
WO2019195268A2 (en) * 2018-04-02 2019-10-10 Grail, Inc. Methylation markers and targeted methylation probe panels
CA3095056A1 (en) * 2018-04-13 2019-10-17 Freenome Holdings, Inc. Machine learning implementation for multi-analyte assay of biological samples

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115497561A (en) * 2022-09-01 2022-12-20 北京吉因加医学检验实验室有限公司 Method and device for layering screening of methylation markers
CN116298295A (en) * 2023-05-18 2023-06-23 上海秤信生物科技有限公司 Tumor autoantigen/antibody combination for early detection of colorectal cancer and application thereof
CN116298295B (en) * 2023-05-18 2023-09-01 上海秤信生物科技有限公司 Tumor autoantigen/antibody combination for early detection of colorectal cancer and application thereof

Also Published As

Publication number Publication date
US20230220492A1 (en) 2023-07-13
KR20230017169A (en) 2023-02-03
AU2021245992A1 (en) 2022-11-10
WO2021202351A1 (en) 2021-10-07
JP2023524627A (en) 2023-06-13
CA3178302A1 (en) 2021-10-07
US20230101485A1 (en) 2023-03-30
EP4127215A1 (en) 2023-02-08

Similar Documents

Publication Publication Date Title
JP7455757B2 (en) Machine learning implementation for multianalyte assay of biological samples
US20210230684A1 (en) Methods and systems for high-depth sequencing of methylated nucleic acid
JP7368483B2 (en) An integrated machine learning framework for estimating homologous recombination defects
US20230101485A1 (en) Methods and systems for detecting colorectal cancer via nucleic acid methylation analysis
US20230175058A1 (en) Methods and systems for abnormality detection in the patterns of nucleic acids
US20230160019A1 (en) Rna markers and methods for identifying colon cell proliferative disorders
US20240084397A1 (en) Methods and systems for detecting cancer via nucleic acid methylation analysis
WO2023183468A2 (en) Tcr/bcr profiling for cell-free nucleic acid detection of cancer
US20240055073A1 (en) Sample contamination detection of contaminated fragments with cpg-snp contamination markers
WO2022243566A1 (en) Dna methylation biomarkers for hepatocellular carcinoma
Luong Predicting Formalin-fixed Paraffin-embedded (FFPE) Sequencing Artefacts from Breast Cancer Exome Sequencing Data Using Machine Learning
WO2023161482A1 (en) Epigenetic biomarkers for the diagnosis of thyroid cancer

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination