CN117535404A

CN117535404A - Multi-cancer methylation detection kit and application thereof

Info

Publication number: CN117535404A
Application number: CN202210914446.XA
Authority: CN
Inventors: 李冰思; 许佳悦; 邱福俊; 汉雨生; 张之宏
Original assignee: Guangzhou Burning Rock Dx Co ltd
Current assignee: Guangzhou Burning Rock Dx Co ltd
Priority date: 2022-08-01
Filing date: 2022-08-01
Publication date: 2024-02-09
Also published as: WO2024027591A1

Abstract

Provides a multi-cancer methylation detection kit and application thereof. Specifically provided is a biomarker combination for evaluating the relatedness of a sample to be tested to the risk of tumor formation and/or tumor tissue sources, wherein the reference gene version related to the differential methylation region DMR is hg19 version. Also provided is the use of a reagent of the biomarker combination for the preparation of a kit for diagnosing a risk of tumor formation and/or an assessment of a tumor tissue origin of a sample. The method can be suitable for risk assessment and tissue tracing of various cancers, and has the advantages of low cost and high accuracy.

Description

Multi-cancer methylation detection kit and application thereof

Technical Field

The application relates to the biomedical field, in particular to a multi-cancer methylation detection kit and application thereof.

Background

DNA methylation is known to play an important role in the regulation of gene expression. Abnormal DNA methylation signatures have been reported in the course of many diseases, including cancer. DNA methylation sequencing is increasingly known as a high resolution, high throughput technique that is useful in cancer screening, diagnosis, and monitoring. Whole genome bisulfite sequencing (WGBS,whole genome bisulfite ssequencing) is a gold standard for methylation sequencing, but is difficult to use clinically due to severe damage to DNA during processing and excessive sequencing costs. More importantly, most regions of the human genome are not active during the development of cancer, and cancer-related variations tend to be concentrated in certain specific regions, such as CpG islands (CpG islans), which provides a good opportunity for targeted sequencing.

Nevertheless, the discovery and screening of cancer-associated methylation differential regions (DMR) is challenging because of non-specific changes in methylation spectra due to crowd heterogeneity, including disease, age, etc. conditions, and thus the need to deal with these non-cancerous but abnormal signals during the cancer assessment DOC model building process. Finally, for the application of detection of various cancer types, the establishment of the tissue traceability TOO model has important auxiliary significance for tracing possible source organs of cancer variation, determining downstream diagnosis and treatment paths and saving medical cost.

Disclosure of Invention

The application establishes a low-cost and high-precision method, adopts DNA or RNA oligonucleotide sequences to capture methylation variation regions of various cancers and specific methylation characteristic regions of various organs, judges the existence of tumor components (ctDNA) in blood free DNA (cfDNA), and evaluates the tissue sources of the tumor components (ctDNA).

In one aspect, the present application provides a biomarker panel for assessing the correlation of a test sample with risk of neoplasia, wherein the biomarker panel comprises any of the at least 10 differential methylation regions DMR shown in table 1A, wherein the DMR in the table relates to a reference gene version that is hg19 version.

In one aspect, the present application provides a biomarker panel for assessing the relatedness of a sample to be tested to a source of tumor tissue, wherein the biomarker panel comprises any of the at least 10 differential methylation regions DMR shown in table 1B, wherein the DMR in the table is related to a reference gene version that is hg 19.

In one aspect, the present application provides a biomarker combination for assessing the correlation of a test sample with the risk of tumour formation and/or tumour tissue origin, characterized in that the biomarker combination comprises any of the at least 10 differentially methylated regions DMR shown in table 1C, wherein the reference gene version referred to by the DMR in the table is the hg19 version.

In one aspect, the present application provides a kit comprising a biomarker combination as described herein, and optionally comprising a second generation high throughput sequencing reagent.

In one aspect, the present application provides the use of a reagent for detecting a biomarker combination described herein in the preparation of a kit for diagnosing risk of neoplasia and/or tumour tissue origin.

In one aspect, the present application provides a method of assessing the correlation of a test sample with the risk of neoplasia and/or source of tumor tissue, the method comprising: detection of methylation levels is performed on a biomarker combination comprising a biomarker combination as described herein in a test sample.

In one aspect, the present application provides a storage medium that records a program that can run the methods described herein.

In one aspect, the present application provides an apparatus comprising a storage medium as described herein, and optionally comprising a processor coupled to the storage medium, the processor configured to execute based on a program stored in the storage medium to implement the methods described herein.

The biomarker combination, the kit, the method, the equipment, the storage medium and the application can be suitable for risk assessment and tissue tracing of various cancers, and have the advantages of low cost and high accuracy.

Other aspects and advantages of the present application will become readily apparent to those skilled in the art from the following detailed description. Only exemplary embodiments of the present application are shown and described in the following detailed description. As those skilled in the art will recognize, the present disclosure enables one skilled in the art to make modifications to the disclosed embodiments without departing from the spirit and scope of the invention as described herein. Accordingly, the drawings and descriptions herein are to be regarded as illustrative in nature and not as restrictive.

Drawings

The specific features of the invention related to this application are set forth in the appended claims. The features and advantages of the invention that are related to the present application will be better understood by reference to the exemplary embodiments and the drawings that are described in detail below. The drawings are briefly described as follows:

FIG. 1 shows an exemplary case (a theoretical exemplary presentation, not intended to represent an actual sequencing case).

FIG. 2 shows another exemplary case (a theoretical exemplary presentation, not intended to represent an actual sequencing case).

Figures 3A-3C show another exemplary scenario (a theoretical exemplary presentation is not intended to represent an actual sequencing scenario).

FIG. 4 shows that in 5-fold cross-validation, a 98% (95% CI: 96-99%) tissue traceability accuracy can be achieved.

FIG. 5 shows the control results of the weighting configuration of the Salmon-DOC model of the present application for confounding relevant features.

FIG. 6 shows that the Salmon-DOC model of the present application can efficiently detect 6 cancer species in different stages in a tumor group model.

FIG. 7 shows that the Salmon-DOC model of the present application overcomes the weakness of past methylation false positives with age in healthy groups, maintaining balance in each age group (horizontal axis for age and vertical axis for model cancer probability score).

Figures 8A-8D show that the Salmon-toi bilayer model traceability accuracy of the present application is superior to that of the monolayer model in both cross-validation and independent validation.

Fig. 9 shows the obtained tissue traceability evaluation result based on 103 toi related DMR regions.

Detailed Description

Further advantages and effects of the invention of the present application will become apparent to those skilled in the art from the disclosure of the present application, from the following description of specific embodiments.

Definition of terms

In the present application, the term "differential methylation region" (DMR) generally refers to a region of DNA comprising one or more differential methylation sites. For example, a DMR that includes a greater number or frequency of methylation sites under selected conditions of interest, such as a cancer state, may be referred to as a hypermethylated DMR. For example, a DMR that includes a lesser number or frequency of methylation sites under selected conditions of interest, such as a cancer state, may be referred to as a hypomethylated DMR.

In this application, the term "second generation gene sequencing (NGS)", high-throughput sequencing "or" next generation sequencing "generally refers to second generation high-throughput sequencing techniques and higher-throughput sequencing methods developed thereafter. The next generation sequencing platform includes, but is not limited to, existing Illumina et al sequencing platforms. With the continued development of sequencing technology, one skilled in the art will appreciate that other methods of sequencing methods and devices may also be employed for the present method. For example, second generation gene sequencing may have the advantages of high sensitivity, large throughput, high sequencing depth, or low cost. According to development history, influence, sequencing principle and technology difference, the following main methods are available: large-scale parallel signature sequencing (Massively Parallel Signature Sequencing, MPSS), polymerase cloning (Polony Sequencing), 454 pyrosequencing (454 pyrosequencing), illumina (Solexa) sequencing, ion semiconductor sequencing (Ion semi conductor sequencing), DNA nanosphere sequencing (DNA nano-ball sequencing), DNA nano-arrays of Complete Genomics and combined probe anchored ligation sequencing methods, and the like. The second generation gene sequencing may enable careful comprehensive analysis of the transcriptome and genome of a species, and is therefore also referred to as deep sequencing. For example, the methods of the present application can be equally applied to first generation gene sequencing, second generation gene sequencing, third generation gene sequencing, or Single Molecule Sequencing (SMS).

In this application, the term "sample to be tested" generally refers to a sample that is to be tested. For example, the presence or absence of a modification in one or more gene regions on a test sample can be detected.

The terms "polynucleotide", "nucleotide", "nucleic acid" and "oligonucleotide" are used interchangeably herein. They represent polymeric forms of nucleotides (deoxyribonucleotides or ribonucleotides) of any length, or analogues thereof. Polynucleotides may have any steric structure and may perform any function, whether known or unknown. The following are non-limiting examples of polynucleotides: coding or non-coding regions of a gene or gene fragment, loci (loci), exons, introns, messenger RNAs (mRNA), transfer RNAs (tRNA), ribosomal RNAs (rRNA), short interfering RNAs (siRNA), short-hairpin RNAs (shRNA), micrornas (miRNA), ribozymes, cdnas, recombinant polynucleotides, branched polynucleotides, plasmids, vectors, isolated DNA of any sequence, isolated RNAs of any sequence, nucleic acid probes, primers and adaptors defined according to linkage analysis. Polynucleotides may include one or more modified nucleotides, such as methylated nucleotides and nucleotide analogs.

In the present application, the term "methylation" generally refers to the methylation state of a gene fragment, a nucleotide or a base thereof in the present application. For example, a DNA fragment in which a gene is located in the present application may have methylation on one or more strands. For example, a DNA fragment in which a gene is located in the present application may have methylation at one site or DMR or multiple sites or DMR.

In this application, the term "human reference genome" generally refers to a human genome that can perform a reference function in gene sequencing. The information of the human reference genome may refer to UCSC. The human reference genome may have different versions, for example, hg19, GRCH37 or ensembl 75.

In this application, the term "machine learning model" generally refers to a collection of system or program instructions and/or data configured to implement an algorithm, process, or mathematical model. In this application, the algorithm, process, or mathematical model may evaluate and provide a desired output based on a given input. In this application, the parameters of the machine learning model may not be explicitly programmed, and in a conventional sense, the machine learning model may not be explicitly designed to follow specific rules in order to provide the desired output for a given input. For example, the use of the machine learning model may mean that the machine learning model and/or the data structure/set of rules as the machine learning model are trained by a machine learning algorithm.

In this application, the term "comprising" is generally intended to include the features specifically recited, but does not exclude other elements.

In this application, the term "about" generally means ranging from 0.5% to 10% above or below the specified value, e.g., ranging from 0.5%, 1%, 1.5%, 2%, 2.5%, 3%, 3.5%, 4%, 4.5%, 5%, 5.5%, 6%, 6.5%, 7%, 7.5%, 8%, 8.5%, 9%, 9.5%, or 10% above or below the specified value.

In order to realize detection of 6 cancer species with high incidence rate and high mortality rate, such as lung cancer, intestinal cancer, liver cancer, ovarian cancer, pancreatic cancer and esophageal cancer, a novel algorithm is adopted to compare methylation mutation and spatial position of genome simultaneously by adopting a mode of combining a public database (TCGA) and internal data mining, and 2536 mutation regions (differentially methylated region, DMR) highly related to cancer are screened out in total

In one aspect, the present application provides a biomarker panel for assessing the correlation of a test sample with a risk of neoplasia, the biomarker panel comprising any of the at least 10 different methylation regions DMR shown in table 1A, wherein the reference gene version referred to by the DMR in the table is the hg19 version.

For example, the biomarker combination comprises 94 DMRs in table 1A. For example, the biomarker combinations comprise about 94 DMR, any at least about 90 DMR, any at least about 80 DMR, any at least about 70 DMR, any at least about 60 DMR, any at least about 50 DMR, any at least about 40 DMR, any at least about 30 DMR, any at least about 20 DMR, or any at least about 10 DMR in table 1A.

In one aspect, the present application provides a biomarker panel for assessing the relatedness of a sample to be tested to a source of tumor tissue, said biomarker panel comprising any of the at least 10 differentially methylated regions DMR shown in table 1B, wherein the reference gene version to which the DMR in the table relates is the hg19 version.

For example, the biomarker combination comprises 103 DMR in table 1B. For example, the biomarker combinations comprise about 103 DMR, any at least about 100 DMR, any at least about 90 DMR, any at least about 80 DMR, any at least about 70 DMR, any at least about 60 DMR, any at least about 50 DMR, any at least about 40 DMR, any at least about 30 DMR, any at least about 20 DMR, or any at least about 10 DMR in table 1B.

In one aspect, the present application provides a biomarker set for assessing the correlation of a test sample with the risk of neoplasia and/or tumour tissue origin, the biomarker set comprising any of the at least 10 different methylation regions DMR shown in table 1C, wherein the reference gene version referred to by the DMR in the table is the hg19 version.

For example, the biomarker combination comprises any at least 222 DMR in table 1E. For example, the biomarker combinations comprise about 222 DMRs, any at least about 220 DMRs, any at least about 210 DMRs, any at least about 200 DMRs, any at least about 150 DMRs, any at least about 100 DMRs, any at least about 90 DMRs, any at least about 80 DMRs, any at least about 70 DMRs, any at least about 60 DMRs, any at least about 50 DMRs, any at least about 40 DMRs, any at least about 30 DMRs, any at least about 20 DMRs, or any at least about 10 DMRs in table 1E.

For example, the biomarker combination comprises 488 DMRs in table 1D. For example, the biomarker combination comprises about 488 DMRs, any at least about 480 DMRs, any at least about 450 DMRs, any at least about 400 DMRs, any at least about 300 DMRs, any at least about 200 DMRs, any at least about 150 DMRs, any at least about 100 DMRs, any at least about 90 DMRs, any at least about 80 DMRs, any at least about 70 DMRs, any at least about 60 DMRs, any at least about 50 DMRs, any at least about 40 DMRs, any at least about 30 DMRs, any at least about 20 DMRs, or any at least about 10 DMRs in table 1D.

For example, the biomarker combination comprises 860 DMRs in table 1C. For example, the biomarker combinations comprise about 860 DMR, any at least about 850 DMR, any at least about 800 DMR, any at least about 700 DMR, any at least about 600 DMR, any at least about 500 DMR, 400 DMR, any at least about 300 DMR, any at least about 200 DMR, any at least about 150 DMR, any at least about 100 DMR, any at least about 90 DMR, any at least about 80 DMR, any at least about 70 DMR, any at least about 60 DMR, any at least about 50 DMR, any at least about 40 DMR, any at least about 30 DMR, any at least about 20 DMR, or any at least about 10 DMR in table 1C.

For example, the tumor is from a homogeneous tumor (homogenous tumors), a heterogeneous tumor, a hematological cancer, and/or a solid tumor. For example, the tumor is from one or more of the following groups of cancers: brain cancer, lung cancer, skin cancer, nasopharyngeal cancer, throat cancer, liver cancer, bone cancer, lymphoma, pancreatic cancer, skin cancer, intestinal cancer, rectal cancer, thyroid cancer, bladder cancer, kidney cancer, oral cancer, stomach cancer, solid tumors, ovarian cancer, esophageal cancer, gall bladder cancer, biliary tract cancer, breast cancer, cervical cancer, uterine cancer, prostate cancer, head and neck cancer, sarcoma, malignant tumors of the thorax other than the lung, melanoma, and testicular cancer. For example, the tumor comprises lung cancer, bowel cancer, liver cancer, ovarian cancer, pancreatic cancer, and/or esophageal cancer.

In one aspect, the present application provides a kit comprising a biomarker combination as described herein, and optionally comprising a second generation high throughput sequencing reagent. For example, the kit can be used to assess the correlation of a test sample with the risk of neoplasia and/or the origin of the tumor tissue.

In one aspect, the present application provides the use of a reagent for detecting a biomarker combination described herein in the preparation of a kit for diagnosing risk of neoplasia and/or tumour tissue origin. For example, the tumor is from a homogeneous tumor (homogenous tumors), a heterogeneous tumor, a hematological cancer, and/or a solid tumor. For example, the tumor is from one or more of the following groups of cancers: brain cancer, lung cancer, skin cancer, nasopharyngeal cancer, throat cancer, liver cancer, bone cancer, lymphoma, pancreatic cancer, skin cancer, intestinal cancer, rectal cancer, thyroid cancer, bladder cancer, kidney cancer, oral cancer, stomach cancer, solid tumors, ovarian cancer, esophageal cancer, gall bladder cancer, biliary tract cancer, breast cancer, cervical cancer, uterine cancer, prostate cancer, head and neck cancer, sarcoma, malignant tumors of the thorax other than the lung, melanoma, and testicular cancer. For example, the tumor comprises lung cancer, bowel cancer, liver cancer, ovarian cancer, pancreatic cancer, and/or esophageal cancer.

In one aspect, the present application provides an assessment method for assessing the correlation of a test sample with the risk of tumor formation and/or the tumour tissue origin of the sample, the method comprising: detection of methylation levels is performed on a biomarker combination comprising a biomarker combination as described herein in a test sample.

For example, the sample is selected from the group consisting of: tissue samples, blood samples, saliva, sputum, pleural effusion, lung lavage, peritoneal effusion, peritoneal lavage, enema, and cerebrospinal fluid.

In one aspect, the present application provides a storage medium that records a program that can run the methods described herein. For example, the non-volatile computer-readable storage medium may include a floppy disk, a flexible disk, a hard disk, a Solid State Storage (SSS) (e.g., solid State Drive (SSD)), a Solid State Card (SSC), a Solid State Module (SSM)), an enterprise-level flash drive, a tape, or any other non-transitory magnetic medium, etc. The non-volatile computer-readable storage medium may also include punch cards, paper tape, optical discs (or any other physical medium having a hole pattern or other optically recognizable indicia), compact disc read-only memory (CD-ROM), rewritable optical discs (CD-RW), digital Versatile Discs (DVD), blu-ray discs (BD), and/or any other non-transitory optical medium.

Without intending to be limited by any theory, the following examples are presented merely to illustrate the kits, methods, uses, etc. of the present application and are not intended to limit the scope of the invention of the present application.

Examples

Example 1

Exemplary bisulfite treated second generation sequencing of samples, the resulting sequencing data contained methylation levels and sequencing coverage depth for methylation site CpG. Optionally, noise removal is performed for genomic methylation signature CpG and noise region CHH/CHG sites. Then, for the "tumor" (C) and "normal" (N) groups, the weighted logistic regression (weighted logistic regression) was calculated to obtain p-value, the logistic regression was interpreted as a continuous variable, i.e., methylation level at each CpG point, and the response variable was binary output, i.e., (0, 1), corresponding to C and N. Weighted logistic regression (weighted logistic regression) examines the differences between C and N for each CpG site, and the null hypothesis (null hypothesis) that the differences between C and N at that CpG site are not statistically significant.

DMR partitioning

Based on the methylation level and sequencing coverage depth of the methylation site CpG, it was determined how the DMR individual regions were divided. Specifically, the methylation level and sequencing coverage depth of the methylation site CpG were calculated according to the following formula:

where d _ij Is the effective coverage depth of the jth site of the ith sample of the C group, M _ij Methylation level at the j-th site of group C i samples was evaluated for similarity in methylation level at spatially consecutive sites of the genome. The deeper the depth of coverage, the greater the value of parameter P, the higher the approximation of the methylation level between adjacent CpG sites within the same group.

For the first CpG site in the region, sample A and sample B obtained coverage of 500 effective sequences, respectively, and sample C obtained coverage of 200 effective sequences. For sample a, the methylation level of this CpG site was 0.2. Sample a had a methylation level of 0 at the second CpG site. The coverage depth parameter value P for the first CpG site of the set was calculated to be 0.617 for three samples. At this time, beta _ij ＝|0.2-0|*e ^(1-0.617) =0.29. Meanwhile, given that the difference in methylation level between the two CpG sites is less than 0.25, which is one of the requirements for dividing the two adjacent sites into the same DMR, the first and second CpG sites in this example will not be divided into the same DMR.

If the sample is replaced with A, B, D (whereinSample D obtained coverage of 400 effective sequences at the first CpG site). Likewise, for sample a, the methylation level of the CpG site is 0.2. Sample a had a methylation level of 0 at the second CpG site. However, due to the increased sequencing coverage of sample D in this example, three samples calculated a coverage depth parameter value P of 0.962 for the first CpG site of the group. At this time, beta _ij ＝|0.2-0|*e ^(1-0.962) If the value of =0.21 is smaller than the threshold value of 0.25 for dividing into the same DMR, then the first and second CpG sites in this example have preconditions for dividing into the same DMR according to sample a.

Therefore, the coverage depth of the CpG sites is introduced by the method, so that the accuracy of DMR region division can be remarkably improved.

Further optionally, for B in a region _ij The calculation method is as follows

Figures 3A-3C show another exemplary scenario (a theoretical exemplary presentation is not intended to represent an actual sequencing scenario). When the DMR region contains 10 CpG sites, B for all samples _ij Combining together, calculating the score of each DMR by an averaging method.

The calculation steps of the B values in the DMR region shown in the group a are shown in the following table:

the B value is scored as 0.1, i.e

Similarly, the B values within DMR shown in group BThe score was 0.7, i.e.,the B value score in DMR shown in group C is 1.233, i.e., +.>

The DMR region screened by the method not only contains the cancer variation information of various cancer species, but also contains the tissue-specific characteristics, and has better segmentation effect at the region boundary.

Example 2

Cancer assessment (DOC) model establishment

The ctDNA content in blood varies greatly from one cancer to another in different stages of development, and is susceptible to experimental batch effects. Furthermore, methylation changes and age, disease, race, etc., which if left untreated, may affect the accuracy of the classification model as confounding variables (confounding variable). The method comprises the steps of firstly quantifying bias brought by confusion variables (a quantification mode can be but is not limited to Hilbert-Schmidt independent criteria), then embedding regularization terms (regularization) of the model for correction, and increasing model accuracy and generalizable capacity.

Algorithm establishment

Assuming m samples, a feature vector X (X ₁ ,…,x _m ) Classification tag Y (Y ₁ ,…,y _m ) Confusion variable Z (Z ₁ ,…,z _m ) Wherein x is _i Is an n-dimensional vector representing the methylation signature of sample i, y _i Is x _i Classification tag, y _i ∈{-1,+1},z _i Is some confounding variable for sample i.

Here L _H The Hilbert-Schmitt coefficient of independence (Hilbert-Schmidt independence criterion) is used to measure the degree of independence of the variables X and Z, h (Y) and h (Z) are the Kernel functions of Y and Z, P _h(x)h(z) Representing the probability distributions of h (y) and h (Z), F and G representing the X and Z regeneration kernel Hilbert space (reproducing kernel Hilbert space), respectively, can be understood as the non-linear post-processed mapped domain of X and Z, C _h(x)h(z) The correlation coefficient (correlation coefficient) of these two kernel functions is referred to as HS, hilbert Space.

Using support vector machines (SVM, support vector machine) as the main classifier

f(x；w,b)＝sgn(wTx+b)

sgn(a)＝1(-1)if a≥0(<0)

The classification interface is determined by solving the following objective equation,

for the insertible data, a soft-spaced support vector machine (soft-margin SVM) introduces a penalty for training errors

Where C controls the balance of minimizing training errors and maximizing classification intervals (margin), and ζ _i Refer to sample x _i The degree of violation of the equation.

Salmon adds a regularization term to the objective equation of SVM solution for confusing factor control, parameter lambda controls the balance of confusing factor error and maximized boundary width in training, the objective equation is

Here C and λ control minimizes training errors, minimizes correlation of confounding variables with interpretation variables, and maximizes the balance of classification intervals.

Each data point represents a blood sample for Salmon-DOC model construction, with the horizontal axis being confunding factor for the corresponding sample and the vertical axis being original uncorrected variable coef (panel a) and corrected variable coef (panel B), respectively. Comparing the correction before and after the correction shows that the confusion related feature is in the Salmon-DOC, and the weight is controlled.

Review queue data

The application adopts retrospective clinical samples of 6 cancer seeds, which are divided into a Training set (Training set) and a Validation set (Validation set), and evaluates the accuracy of a Salmon binary classifier (cancer vs. non-cancer).

Example 3

Tissue Traceability (TOO) model building

First layer TOO model construction

The TOO model is essentially a multi-classification problem, and for each class (class) probability calculation can be reduced to voting (ranking) on pairs of bi-class (pairing) results, then choosing the most votes. However, for possible clinical applications of the tissue traceability model, it is not enough to generate only one classification result, and only the probability of classification is generated, so that superposition (assembly) of the models is possible.

The first step in the Salmon-TOO model of the present application is therefore to quantify the outcome of the classification vote (voing). This quantification can be demonstrated by probability calculations. If a certain data point x and label y are defined, we assume a pairwise classification probability μ _ij If any, we can get a model from the ith and jth categories in the training set, and can use the calculated r as long as any new data point x is entered _ij As mu _ij Is a similar estimate of (a). The problem can be reduced to using all r _ij To estimate the probability of the ith category

p _i ＝P(y＝i|x),i＝1,…,k

Definition r _ij Mu is _ij Let μ be the estimate of _ij +μ _ji =1. A "voting" system is used for multi-classification problems,

μ _ij ≡P(y＝i|y＝i or j,x)

definition I is the target equation: i _{x} =1 if x is true, otherwise false. The probability calculation can be written as

Second layer TOO model construction

The second layer of the Salmon-TOO model is MLR fitting for different classes (classes)

Assuming that a probability calculation is required for the source of the seed tissue, a quantized classification probability may be obtained from the first layer, the value range is (+_infinity, - +_infinity). Because the actual distribution of each pair of the classification probabilities is inconsistent, the quantized classification probabilities can be further used as interpretation variables of logistic regression, and the reaction variables adopt multiple outputs corresponding to the known tissue sources in the modeling process.

As shown in the above table, each column represents a characteristic variable of the logistic regressionI.e., two-class assessment probabilities for two-by-two tissue classes; each row represents a reaction variable y ₁ I.e., tissue class (class).

For the feature variables used to interpret the two-class probabilities, the evaluation result is converted into Y assuming that there are J discontinuous reflection variables in total _i1 ，…，Y _iJ ，β _j For feature weights based on each of the reflected variables.

Since in the Salmon-DOC model we can get that it is judged negative in some cancer species and positive in some cancer species, for this judgment, when performing the traceability modeling, the tissue class (class) is subjected to weight correction based on the quasi-maximum likelihood estimation method, and taking binary logistic regression as an example, it can be interpreted as:

review queue data

All data of the review queue is randomized 1:1 into a training set and a verification set. Firstly, cross verification is carried out through a training set to obtain a traceability evaluation result, and model parameters are continuously optimized and finally locked in the process. And finally, evaluating the tracing result of all data of the verification set by using the locked model. In the traceable model training set, the total sample size of six cancers is 300, and the number of each stage of each cancer is relatively balanced: 36 cases of lung cancer (the number of cases of I-IV is 4/12/5/15 respectively), 62 cases of intestinal cancer (the number of cases of I-IV is 8/18/18/18 respectively), 74 cases of liver cancer (the number of cases of I-IV is 25/14/22/13 respectively), 48 cases of ovarian cancer (the number of cases of I-IV is 1/4/38/5 respectively), 40 cases of pancreatic cancer (the number of cases of I-IV is 3/6/13/18 respectively), 42 cases of esophageal cancer (the number of cases of I-IV is 5/10/15/12 respectively). A total of 224 samples of the traceability model verification set comprise: 31 cases of lung cancer (the number of cases of I-IV is 4/5/12/10 respectively), 52 cases of intestinal cancer (the number of cases of I-IV is 7/15/13/17 respectively), 55 cases of liver cancer (the number of cases of I-IV is 17/11/20/7 respectively), 27 cases of ovarian cancer (the number of cases of I-IV is 3/4/8/12 respectively), 25 cases of pancreatic cancer (the number of cases of I-IV is 4/6/6/9 respectively), 34 cases of esophageal cancer (the number of cases of I-IV is 4/7/8/15 respectively).

FIG. 8 shows that the Salmon-TOO bilayer model traceability accuracy of the present application is superior to that of the monolayer model in both cross-validation and independent validation.

Fig. A, B is a traceable evaluation result of cross-validation of six cancer species data in a six cancer species training set. Wherein, the graph A is the result output after only the first layer TOO model is constructed, the tracing accuracy is 0.87 (260/300), and if the suboptimal tracing result is included, the accuracy is 0.93 (279/300); FIG. B shows the output result of the second layer MLR model supplemented on the basis of the first layer TOO model, the tracing accuracy is improved to 0.90 (270/300), and if the suboptimal tracing result is included, the accuracy can be further improved to 0.95 (284/300). Similarly, fig. C, D is a traceable evaluation result of the independent verification of six cancer species data in the verification set. Wherein, the graph C is the result output after only the first layer TOO model is constructed, the tracing accuracy is 0.77 (173/224), and if the suboptimal tracing result is included, the accuracy is 0.87 (194/224); graph D shows the output result of supplementing the second layer MLR model based on the first layer TOO model, the tracing accuracy is improved to 0.84 (187/224), and if the suboptimal tracing result is included, the accuracy can be further improved to 0.89 (199/224).

In conclusion, the evaluation accuracy of the Salmon-TOO double-layer traceability model is better than that of a single-layer model in the cross validation and independent validation of a training set.

Example 4

DOC cancer detection model

Table 1A shows 94 DMR regions for DOC cancer detection model

/>

Based on 94 DOC related DMR regions, 100 healthy human samples and 318 six cancer positive samples in independent verification set 1 were evaluated with an overall sensitivity of 80.5% (256/318) and an overall specificity of 95% (95/100). At 90% level of specificity, specific cancer species and stage sensitivity are as follows:

/>

repeated tests were then performed, each employing 50 random ones of the 94 DOC zones. The sensitivity results of six cancer positive samples in five replicates at 90% (90/100) level of specificity are shown in the following table:

example 5

TOO organization traceability model

Table 1B shows 103 DMR regions for TOO organization traceability model

/>

Based on 103 TOO related DMR regions, performing traceability evaluation on 473 cases of six cancer positive samples in the independent verification set 2, wherein the first traceability accuracy is 63.0% (298/473), and if a suboptimal traceability result is included, the accuracy can be improved to 71.5% (338/473).

Four rounds of repeated testing were then performed, each time taking a random 50 of 103 TOO regions, with the traceability accuracy results in four rounds of evaluation shown in the following table:

example 6

DMR simultaneously evaluates DOC and toi:

table 1C shows 860 DMR regions for DOC and TOO evaluation models

/>

Table 1D shows 488 DMR regions for DOC and TOO assessment models

/>

Table 1E shows 222 DMR regions for DOC and TOO evaluation models

/>

In independent validation set 3, sensitivity and tracing accuracy at uniform specificity of 95.1% (450/473) were calculated for 473 negative samples and 473 positive six-cancer samples with progressive gradient compression of the marker number. The tumor detection and tissue traceability results of the evaluation are shown in the following table:

/>

the foregoing detailed description is provided by way of explanation and example and is not intended to limit the scope of the appended claims. Numerous variations of the presently exemplified embodiments of the present application will be apparent to those of ordinary skill in the art and remain within the scope of the appended claims and equivalents thereof.

Claims

1. A biomarker panel for assessing the correlation of a test sample with risk of neoplasia, the biomarker panel comprising any of at least 10 differential methylation regions DMR as set forth in table 1A, wherein the reference gene version referred to by the DMR in the table is the hg19 version.

2. The biomarker combination according to claim 1, comprising any of at least 50 DMR in table 1A.

3. The biomarker combination according to any of claims 1 to 2, comprising 94 DMR in table 1A.

4. A biomarker panel for assessing the relatedness of a sample to be tested to a source of tumour tissue, said biomarker panel comprising any of at least 10 different methylation regions DMR as shown in table 1B, wherein the reference gene version referred to by the DMR in the table is the hg19 version.

5. The biomarker combination according to claim 4, comprising any of at least 50 of table 1B

DMR。

6. The biomarker combination according to any of claims 4 to 5, comprising 103 DMR in table 1B.

7. A biomarker combination for assessing the relatedness of a test sample to a tumor formation risk and/or a tumor tissue source, characterized in that said biomarker combination comprises any of at least 10 different methylation regions as shown in table 1C

DMR, wherein the DMR in the table refers to a reference gene version of hg 19.

8. The biomarker combination of claim 7, comprising at least 50 DMR of any of table 1E, table 1D or table 1C.

9. The biomarker combination according to any of claims 7 to 8, comprising 222 DMR in table 1E.

10. The biomarker combination according to any of claims 7 to 9, comprising 488 DMRs in table 1D.

11. The biomarker combination according to any of claims 7 to 10, comprising 860 DMR in table 1C.

12. The biomarker combination according to any of claims 1 to 11, wherein the tumour is derived from a homogenetic tumour (homogenetic tumour)

tumor), heterogeneous tumors, hematological cancers and/or solid tumors; preferably, the tumor is from one or more of the following groups of cancers: brain cancer, lung cancer, skin cancer, nasopharyngeal cancer, throat cancer, liver cancer, bone cancer, lymphoma, pancreatic cancer, skin cancer, intestinal cancer, rectal cancer, thyroid cancer, bladder cancer, kidney cancer, oral cancer, stomach cancer, solid tumors, ovarian cancer, esophageal cancer, gall bladder cancer, biliary tract cancer, breast cancer, cervical cancer, uterine cancer, prostate cancer, head and neck cancer, sarcoma, malignant tumors of the thorax other than the lung, melanoma, and testicular cancer.

13. The biomarker combination according to any of claims 1 to 12, wherein the tumour comprises lung cancer, bowel cancer, liver cancer, ovarian cancer, pancreatic cancer, and/or oesophageal cancer.

14. A kit comprising the biomarker combination of any of claims 1-13, and optionally comprising a second generation high throughput sequencing reagent.

15. The kit of claim 14 for assessing the correlation of a test sample with the risk of tumor formation and/or the origin of tumor tissue.

16. Use of a reagent for detecting a biomarker combination according to any of claims 1 to 13, in the manufacture of a kit for diagnosing the risk of tumour formation and/or tumour tissue origin.

17. The use of claim 16, wherein the tumor is derived from a homogeneous tumor (homogenous tumors), a heterogeneous tumor, a hematological cancer and/or a solid tumor; preferably, the tumor is from one or more of the following groups of cancers: brain cancer, lung cancer, skin cancer, nasopharyngeal cancer, throat cancer, liver cancer, bone cancer, lymphoma, pancreatic cancer, skin cancer, intestinal cancer, rectal cancer, thyroid cancer, bladder cancer, kidney cancer, oral cancer, stomach cancer, solid tumors, ovarian cancer, esophageal cancer, gall bladder cancer, biliary tract cancer, breast cancer, cervical cancer, uterine cancer, prostate cancer, head and neck cancer, sarcoma, malignant tumors of the thorax other than the lung, melanoma, and testicular cancer.

18. The use of any one of claims 16-17, wherein the tumor comprises lung cancer, bowel cancer, liver cancer, ovarian cancer, pancreatic cancer, and/or esophageal cancer.

19. A method of assessing the correlation of a test sample with the risk of tumour formation and/or tumour tissue origin, the method comprising: detecting the methylation level of a biomarker combination comprising a biomarker combination according to any of claims 1 to 13 in a sample to be tested.

20. The assessment method of claim 19, the sample being selected from the group consisting of: tissue samples, blood samples, saliva, sputum, pleural effusion, lung lavage, peritoneal effusion, peritoneal lavage, enema, and cerebrospinal fluid.

21. A storage medium carrying a program operable to perform the method of any one of claims 19 to 20.

22. An apparatus comprising the storage medium of claim 21, and optionally comprising a processor coupled to the storage medium, the processor configured to execute to implement the method of any of claims 19-20 based on a program stored in the storage medium.