CN118240934A

CN118240934A - Methylation signal detection method, device and kit

Info

Publication number: CN118240934A
Application number: CN202310567683.8A
Authority: CN
Inventors: 高强; 樊嘉; 顾建英; 江孙芳; 纪元; 许佳悦
Original assignee: Guangzhou Burning Rock Dx Co ltd; Zhongshan Hospital Fudan University
Current assignee: Guangzhou Burning Rock Dx Co ltd; Zhongshan Hospital Fudan University
Priority date: 2022-12-22
Filing date: 2023-05-19
Publication date: 2024-06-25

Abstract

The invention provides a methylation signal detection method, a methylation signal detection device and a methylation signal detection kit, and particularly relates to a biomarker combination for detecting the methylation level of a sample to be detected, wherein the biomarker combination comprises any of at least 10 different methylation regions DMR shown in a table 1, wherein a reference genome adopted by the DMR in the table 1 is GRCh37/hg19 human reference genome, and the risk of cancer formation can be evaluated with low cost and high accuracy.

Description

Methylation signal detection method, device and kit

Technical Field

The invention relates to the technical field of biology, in particular to a methylation signal detection method and a kit.

Background

In 2018, human cancers have resulted in a number of deaths worldwide, most of which are diagnosed as late. To date, intervention prior to distant metastasis provides the greatest opportunity to improve prognosis, and therefore it is highly desirable to develop sensitive, reliable and minimally invasive assays to detect cancer prior to the appearance of symptoms. Among many cancer species, liver cancer (hepatocellular carcinoma, HCC) is a serious disease that seriously jeopardizes human health, and is not only high in incidence but also hidden, fast in progress, high in recurrence rate and mortality, and is called "king in cancer". Most liver cancer patients who visit hospitals are middle or late, and if the natural course of the liver cancer patients is not treated, the liver cancer patients only need 3-6 months. Currently, the detection means of liver cancer mainly comprise two types of serum marker detection and imaging detection.

Existing liver cancer serum marker assays include serum Alpha Fetoprotein (AFP) assays and hematological and other tumor marker assays. Among them, serum Alpha Fetoprotein (AFP) assay has relative specificity for diagnosing liver cancer. The continuous serum AFP is more than or equal to 400 mug/L by the radioimmunoassay, and can exclude pregnancy, active liver diseases and the like, thus being capable of considering diagnosis of liver cancer. However, about 30% of liver cancer patients clinically have negative AFP, and thus have low specificity. Blood enzymology and other tumor marker tests are performed by the principle that gamma-glutamyl transpeptidase and its isozyme, abnormal prothrombin, alkaline phosphatase and lactate dehydrogenase isozyme in serum of liver cancer patients can be higher than normal. But also lack specificity.

Imaging examinations typically include ultrasound examinations, computed Tomography (CT) examinations, magnetic Resonance Imaging (MRI) examinations, selective celiac or hepatic angiography examinations, and liver puncture needle aspiration cytology examinations, but imaging examinations are performed after a tumor has been formed and has reached a certain size, failing to achieve the purpose of early cancer or early cancer screening.

Currently, DNA methylation sequencing is increasingly known as a high resolution, high throughput technique that is useful in cancer screening, diagnosis, and monitoring. Most regions of the human genome are not active during the development of cancer, and cancer-related variations tend to concentrate in certain specific regions, such as CpG islands (CPG ISLAND), which provides a good opportunity for targeted sequencing. Despite the vast number of scientific articles reporting biomarkers based on DNA methylation and their clinical relevance in cancer, only a few tens of biomarkers have been converted into commercial clinical test products, related products directed to single cancers (e.g., liver cancer) are more scarce. Meanwhile, the discovery and screening of cancer-related differential methylation regions (DIFFERENTIALLY METHYLATED regions, DMR) is challenging, and because of the non-specific changes in methylation profiles due to crowd heterogeneity, including disease, age, etc., signals that are non-cancerous but abnormal need to be processed during the cancer assessment model building process. Therefore, there is an urgent need to develop methods and biomarker combinations for capturing and assessing risk of cancer formation for DMR of cancer.

Disclosure of Invention

The invention provides a methylation signal detection method, a methylation signal detection device and a methylation signal detection kit, which adopt DNA or RNA oligonucleotide sequences to capture methylation variation regions of cancers, judge the existence of tumor components (ctDNA) in a sample to be detected, and provide a low-cost and high-accuracy method for the correlation evaluation of cancer formation risks.

In one aspect, the invention provides a biomarker panel for assessing the risk correlation of a test sample with cancer formation, wherein the biomarker panel comprises any of the at least 10 different methylation regions DMR shown in table 1, wherein the reference genome employed by the DMR in table 1 is the GRCh37/hg19 human reference genome.

In another aspect, the invention provides a kit comprising reagents for detecting a biomarker combination as described above.

In another aspect, the invention provides the use of a reagent for detecting a biomarker combination as described above in the manufacture of a kit for diagnosing risk of cancer formation.

In another aspect, the invention provides a method of assessing the correlation of a test sample with the risk of cancer formation, comprising: obtaining methylation level data obtained by detecting a biomarker combination in a sample to be detected, wherein the biomarker combination comprises the biomarker combination; based on the methylation level data, carrying out quantization treatment on bias caused by confusion variables corresponding to the sample to be detected to obtain corrected methylation level data; based on the comparison result of the corrected methylation level data and a preset threshold value, indicating information for representing the degree of correlation between the sample to be tested and the cancer formation risk is generated.

In another aspect, the invention provides an apparatus for assessing the correlation of a test sample with the risk of cancer formation, comprising: an acquisition unit configured to acquire methylation level data obtained by detecting a biomarker combination in a sample to be detected, wherein the biomarker combination comprises the biomarker combination; the correction unit is configured to perform quantization processing on bias caused by the confounding variable corresponding to the sample to be detected based on the methylation level data to obtain corrected methylation level data; and a determining unit configured to generate indication information for representing the degree of correlation between the sample to be tested and the risk of cancer formation based on the comparison result of the corrected methylation level data and a preset threshold value.

In another aspect, the present invention provides an electronic device, including: one or more processors; and a storage device having one or more programs stored thereon, which when executed by the one or more processors, cause the one or more processors to implement the method.

In another aspect, the present invention provides a computer readable storage medium having a computer program stored thereon, wherein the computer program, when executed by one or more processors, implements the method described above.

The biomarker combination, the kit, the method, the application, the device, the electronic equipment and the storage medium can be suitable for risk assessment of cancers, and have the advantages of low cost and high accuracy.

Specifically, the cancers include one or more of the following: brain cancer, lung cancer, skin cancer, nasopharyngeal cancer, throat cancer, liver cancer, bone cancer, lymphoma, pancreatic cancer, skin cancer, intestinal cancer, rectal cancer, thyroid cancer, bladder cancer, kidney cancer, oral cancer, stomach cancer, solid tumors, ovarian cancer, esophageal cancer, gall bladder cancer, biliary tract cancer, breast cancer, cervical cancer, uterine cancer, prostate cancer, head and neck cancer, sarcoma, malignant tumors of the thorax (except lung), melanoma, and testicular cancer.

Drawings

Other features, objects and advantages of the present invention will become more apparent upon reading of the detailed description of non-limiting embodiments made with reference to the following drawings. The drawings are only for purposes of illustrating particular embodiments and are not to be construed as limiting the invention. In the drawings:

fig. 1 shows an exemplary case where CpG sites cannot be classified into the same DMR.

Fig. 2 shows an exemplary case where CpG sites are partitioned into the same DMR.

Fig. 3 illustrates an exemplary case for explaining the principle of judging whether the DMR is valid or not in the present invention.

Fig. 4 shows the control results of the weight configuration of the confounding variables in the DOC model of the present application.

Fig. 5 shows that the DOC model established by the present invention remains balanced across the age groups.

Figure 6 shows the distribution of five replicates of 10 DMR at random according to the invention for healthy people, cancer patients as a whole and for different sensitivities of cancer patients under 80% specificity conditions.

Detailed Description

I. Definition of the definition

In the present invention, unless otherwise indicated, scientific and technical terms used herein have the meanings commonly understood by one of ordinary skill in the art. Also, protein and nucleic acid chemistry, molecular biology, cell and tissue culture, microbiology, immunology-related terms and laboratory procedures as used herein are terms and conventional procedures that are widely used in the corresponding arts. Meanwhile, in order to better understand the present invention, definitions and explanations of related terms are provided below.

As used herein, the term "differential methylation region" (DIFFERENTIALLY METHYLATED region, DMR) generally refers to a region of DNA that contains one or more differential methylation sites. For example, a DMR that includes a greater number or frequency of methylation sites under selected conditions of interest, such as a cancer state, may be referred to as a hypermethylated DMR. For example, a DMR that includes a lesser number or frequency of methylation sites under selected conditions of interest, such as a cancer state, may be referred to as a hypomethylated DMR.

As used herein, the term "methylation" generally refers to the methylation state of a gene fragment, nucleotide, or base thereof of the present application. For example, a DNA fragment in which a gene of the application is located may have methylation on one or more strands. For example, a DNA fragment in which a gene of the application resides may have methylation at one site or DMR or at multiple sites or DMR.

As used herein, the term "next generation sequencing" (Next Generation Sequencing, NGS) refers to any sequencing method that determines the nucleotide sequence of an individual nucleic acid molecule (e.g., in single molecule sequencing) or of a surrogate of an individual nucleic acid molecule that is clonally amplified in a high-throughput mode (e.g., sequencing more than 10 ³、10⁴、10⁵ molecules or more simultaneously). The next generation sequencing platform includes, but is not limited to, existing Illumina et al sequencing platforms. With the continued development of sequencing technology, one skilled in the art will appreciate that other methods of sequencing methods and devices may also be employed for the present method. The next generation sequencing methods include, for example, sequencing by synthesis technology (Illumina), large-scale parallel signature sequencing (MASSIVELY PARALLEL Signature Sequencing, MPSS), polymerase cloning (Polony Sequencing), pyrosequencing (454), ion semiconductor technology (ion-shock sequencing) (Ion semi conductor sequencing), DNA nanoball sequencing (DNA nano-ball sequencing), DNA nanoarray-and-combinatorial probe anchored ligation sequencing of Complete Genomics, single molecule real-time sequencing (Pacific Biosciences), and sequencing by ligation (SOLiD sequencing), and the like. The next generation sequencing described above may enable detailed analysis of the transcriptome and genome of a species, and is therefore also referred to as deep sequencing. For example, the methods of the invention are equally applicable to first generation gene sequencing, second generation gene sequencing, third generation gene sequencing, or Single Molecule Sequencing (SMS).

As used herein, the term "human reference genome" generally refers to a human genome that can perform a reference function in gene sequencing. The above information of the human reference genome may refer to UCSC. The human reference genome may be in different versions, for example, hg19, hg38, GRCh37, GRCh38, gca_000001405, gcf_000001405, or Ensembl75.

As used herein, the terms "polynucleotide," "nucleotide," "nucleic acid," and "oligonucleotide" are used interchangeably. They represent polymeric forms of nucleotides (deoxyribonucleotides or ribonucleotides) of any length, or analogues thereof. Polynucleotides may have any steric structure and may perform any function, whether known or unknown. The following are non-limiting examples of polynucleotides: coding or non-coding regions of a gene or gene fragment, loci (loci), exons, introns, messenger RNAs (mRNA), transfer RNAs (tRNA), ribosomal RNAs (rRNA), short interfering RNAs (siRNA), short-hairpin RNAs (shRNA), micrornas (miRNA), ribozymes, cdnas, recombinant polynucleotides, branched polynucleotides, plasmids, vectors, isolated DNA of any sequence, isolated RNAs of any sequence, nucleic acid probes, primers and adaptors defined according to linkage analysis. Polynucleotides may include one or more modified nucleotides, such as methylated nucleotides and nucleotide analogs.

As used herein, the term "sample to be tested" generally refers to a sample that is to be tested. For example, the presence or absence of a modification in one or more gene regions on a test sample can be detected. In embodiments of the present invention, the sample to be tested includes, but is not limited to, a tissue sample, a blood sample, saliva, sputum, pleural effusion, pulmonary lavage, peritoneal effusion, peritoneal lavage, and cerebrospinal fluid.

As used herein, the term "about index", also known as the correct index, is a method of evaluating the authenticity of a screening test, which can be applied given the equivalent meaning of the hazard of false negatives (missed diagnosis rates) and false positives (false misdiagnosis rates). The about log index is the sum of sensitivity and specificity minus 1. Indicating the total ability of the screening method to find true patients and non-patients. The larger the index, the better the effect of the screening experiment, and the greater the authenticity. The term "about log index optimum" is the case where the sum of sensitivity and specificity minus 1 is the largest.

Detailed description of the preferred embodiments

In one aspect, the invention provides a biomarker panel for assessing the risk correlation of a test sample with cancer formation, wherein the biomarker panel comprises any of at least 10 different methylation regions DMR as shown in table 1, wherein the reference genome employed by the DMR in table 1 is the GRCh37/hg19 human reference genome.

In some preferred embodiments, the 10 DMRs are any one set of DMRs shown in table 3.

In some preferred embodiments, the biomarker combinations described above comprise all 100 DMRs shown in table 1.

In another aspect, the invention provides a kit, wherein the kit comprises reagents for detecting the biomarker combination.

In some alternative embodiments, the above-described kits comprise next-generation sequencing reagents.

In some preferred embodiments, the next generation sequencing reagents described above include hybridization capture probes or primers covering any of at least 10 DMR in table 1.

In some more preferred embodiments, the next generation sequencing reagents described above include hybridization capture probes or primers that cover any of the sets of DMRs shown in table 3.

In some more preferred embodiments, the next generation sequencing reagents described above include hybridization capture probes or primers that cover all 100 DMRs shown in table 1.

In some alternative embodiments, the above-described kit is used to assess the correlation of a test sample with the risk of cancer formation.

In some preferred embodiments, the cancer comprises: liver cancer, lung cancer, intestinal cancer, pancreatic cancer, esophageal cancer and ovarian cancer.

In some more preferred embodiments, the cancer is liver cancer.

In another aspect, the invention provides a method of assessing the correlation of a test sample with the risk of cancer formation, comprising: obtaining methylation level data obtained by detecting a biomarker panel in a sample to be tested, wherein the biomarker panel comprises the biomarker panel according to claim 1; based on the methylation level data, carrying out quantization treatment on bias caused by confusion variables corresponding to the sample to be detected to obtain corrected methylation level data; based on the comparison result of the corrected methylation level data and a preset threshold value, indicating information for representing the degree of correlation between the sample to be tested and the cancer formation risk is generated.

In some preferred embodiments, the first indication information indicative of the degree of risk of cancer formation correlation is generated in response to the corrected methylation level data being less than or equal to a preset threshold value, and the second indication information indicative of the degree of risk of cancer formation correlation is generated in response to the corrected methylation level data being greater than the preset threshold value. The indication information may be a prompt information indicating whether there is a risk of cancer formation or indicating a risk of cancer formation of different degrees. For example, the first indication information may be a prompt information for indicating that there is a risk of forming cancer or a risk of forming cancer is high, and the second indication information may be a prompt information for indicating that there is no risk of forming cancer or a risk of forming cancer is low.

In some preferred embodiments, the preset threshold is a mean value of the threshold corresponding to the best condition of the per-trade-off reduction log index by performing ten-fold cross-validation on the training set samples.

In some preferred embodiments, the predetermined threshold is taken from the range of-0.4 to-1.65.

In some more preferred embodiments, the predetermined threshold is taken from the range of-1.22 to-1.65.

In some more preferred embodiments, the cancer is liver cancer.

In some alternative embodiments, the sample to be tested is selected from any one or more of the following: tissue samples, blood samples, saliva, sputum, pleural effusion, lung lavage, peritoneal effusion, peritoneal lavage, enema, and cerebrospinal fluid.

In some preferred embodiments, the above-mentioned determining unit is further configured to generate first indication information for indicating the degree of risk of cancer formation correlation in response to the corrected methylation level data being less than or equal to a preset threshold value, and generate second indication information for indicating the degree of risk of cancer formation correlation in response to the corrected methylation level data being greater than the preset threshold value. The indication information may be a prompt information indicating whether there is a risk of cancer formation or indicating a risk of cancer formation of different degrees. For example, the first indication information may be a prompt information for indicating that there is a risk of forming cancer or a risk of forming cancer is high, and the second indication information may be a prompt information for indicating that there is no risk of forming cancer or a risk of forming cancer is low.

In some more preferred embodiments, the cancer is liver cancer.

The implementation environment of the present invention includes an electronic device, and the method for evaluating the correlation between the sample to be tested and the risk of cancer formation in the embodiment of the present invention may be executed by a terminal device. By way of example, the electronic device may comprise at least one of a terminal device or a server.

The terminal device may be hardware or software. When the terminal device is hardware, it may be a variety of electronic devices having a display screen and supporting information input (e.g., text input and/or voice input, etc.), including but not limited to smart phones, tablet computers, laptop and desktop computers, and the like. When the terminal device is software, it can be installed in the above-listed terminal device. It may be implemented as a plurality of software or software modules (e.g. to provide a correlation service for assessing the risk of developing cancer in a sample to be tested) or as a single software or software module. The present invention is not particularly limited herein.

The computer readable medium of the present invention may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, fiber optic cables, RF (radio frequency), and the like, or any suitable combination of the foregoing.

The computer readable medium may be contained in the electronic device; or may exist alone without being incorporated into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to implement the method of assessing the correlation of a sample under test with risk of cancer formation shown in the above-described embodiments and alternative embodiments thereof.

Computer program code for carrying out operations of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

The above description is only illustrative of the preferred embodiments of the present invention and of the principles of the technology employed. It will be appreciated by persons skilled in the art that the scope of the disclosure referred to in the present invention is not limited to the specific combinations of technical features described above, but also covers other technical features formed by any combination of the technical features described above or their equivalents without departing from the spirit of the disclosure. Such as the above-mentioned features and the technical features disclosed in the present invention (but not limited to) having similar functions are replaced with each other.

Examples

Example 1: division of DMR regions

1. Hypothesis testing

Obtaining a sample to be tested (for example, a blood sample), wherein the sample to be tested is divided into a liver cancer group (C group) and a normal group (N group), and the bisulfite methylation sequencing of the sample to be tested can comprise the following steps:

s1: cell-free DNA (cfDNA) extraction: for example, the QiaAmp cycle nucleic acid kit (Qiagen, 55114) and its corresponding platform can be used;

s2: bisulfite conversion: for example, the bisulfite conversion step (Bisulfite Conversion, BC) is performed using a modified protocol according to EZ-96DNA methylation-LIGHTNINGTM MAGPREP (Zymo, D5047);

S3: pre-library preparation: comprises a first tailing and connecting step, wherein a plurality of G or A synthesized randomly by a split (splinter) joint can be used, the 3' -end poly-C/T tail of a single-stranded DNA substrate is annealed, and the connection is completed after hybridization with the first tail through a cantilever of the joint; annealing the DNA substrate with the adaptor added at one end into a single strand, performing 5-15 rounds of linear amplification, performing a second tailing and connecting step by adopting a similar step to the first tailing and connecting step, connecting the second adaptor to the A tail at the other end of the DNA substrate, and performing a plurality of rounds of PCR amplification to complete the preparation of a pre-library (for example, see Chinese patent publication CN 110892097A);

s4: pre-library hybridization: hybridizing a pre-library with a hybridization capture probe covering the target DMR region;

S5: capturing and eluting: the non-specific fragments are eluted through the combination of the magnetic beads and the probes, the magnetic beads are removed, and the final library is formed through PCR amplification;

s6: sequencing: and sequencing the final library by an NGS sequencer to generate sequencing data containing the target DMR region.

In this embodiment, the step of noise reduction treatment for genomic methylation signal CpG and noise region CHH/CHG sites may be optionally included, for example, see Chinese patent publication CN114974417A.

Based on each CpG site, carrying out hypothesis test on whether the difference between the C group and the N group has statistical significance, respectively calculating the P value of each CpG site in the C group and the N group, wherein the calculation process adopts weighted logistic regression (WEIGHTED LR, weighted Logistic Regression), determines the given weight according to the coverage depth of each CpG site, takes the methylation level of each CpG site as an explanatory variable, and outputs a binary result of (0, 1) to correspond to C and N.

Partitioning of DMR

Calculating according to the following formula, taking the methylation level and sequencing coverage depth of each methylation CpG site as parameters, evaluating the similarity of the methylation level of the genome space continuous sites, wherein the deeper the coverage depth is, the larger the value of the parameter P in the following formula is, the higher the similarity of the methylation level between adjacent CpG sites in the same group (liver cancer group or normal group) is, and further dividing the DMR:

The subscript ij of each parameter represents the j-th site of the i-th sample, the parameter d is used for representing the effective coverage depth of the CpG sites in the liver cancer group, and the parameter M is used for representing the methylation level of the CpG sites in the liver cancer group.

Taking a beta value as a judging index after calculation, taking beta=0.25 as a preset threshold value, substituting the j and (j+1) th sites into a calculation area statistic B (B value is used for representing whether the DMR obtained by division is a valid DMR) when the beta is smaller than the preset threshold value, and possibly dividing into one DMR; when β is greater than or equal to the preset threshold, the jth and (j+1) th sites cannot be substituted into the calculated region statistic B and are not divided into one DMR.

In this embodiment, an exemplary case (as shown in fig. 1) that the DMR cannot be divided into the same DMR is given to explain the principle of dividing the DMR in the present invention.

Wherein the colored dots characterize a methylated CpG site, sample A, sample B, and sample C are from the same sample group (e.g., tumor group or normal group as described above), wherein sample A and sample B each obtain coverage of 500 effective sequences, and sample C obtains coverage of 200 effective sequences. The dots of each column correspond to the same CpG site, with the methylation level of the first CpG site in the region being 0.2 and the methylation level of the second CpG site being 0 in sample A.

The coverage depth parameter value P for the first CpG site within the region was calculated to be 0.617 for sample a, sample B and sample C above. At this time, by substituting the above parameters into the above formula, β ₁₁ can be calculated to be 0.29, and based on the preset threshold value of 0.25, the methylation level difference between the first CpG site and the second CpG site in the region is greater than 0.25, so that the two adjacent CpG sites are not classified into the same DMR.

Another exemplary case of dividing into the same DMR is given in this embodiment (as shown in fig. 2) to explain the principle of dividing the DMR in the present invention.

Wherein the colored dots characterize a methylated CpG site, sample A, sample B and sample D are from the same sample group (e.g., tumor group or normal group) and wherein sample A and sample B each obtain coverage of 500 effective sequences and sample D obtains coverage of 400 effective sequences (the coverage depth of sample D is increased compared to sample C in the previous example, and thus the P value in the present example is also increased accordingly). Also, in sample a, the methylation level of the first CpG site in this region is 0.2 and the methylation level of the second CpG site is 0.

The coverage depth parameter value P for the first CpG site within the region was calculated to be 0.962 for sample a, sample B and sample D above. At this time, the above parameters are substituted into the above formula, and β ₁₁ is calculated to be 0.21, and based on the preset threshold value of 0.25, the methylation level difference between the first CpG site and the second CpG site in the region is less than 0.25, so that the two adjacent CpG sites are marked into the same DMR.

The above method can be seen in chinese patent publication CN115132273a.

Therefore, the coverage depth of CpG sites is introduced in the DMR division process by the method, so that the accuracy of DMR region division can be remarkably improved.

3. Calculation of region statistics B value

In some optional embodiments, based on the above calculated β value, a region statistic B value of CpG sites in the region is further calculated according to the following formula to represent whether the DMR obtained by the division is a valid DMR.

The calculation formula of the value B is as follows:

Wherein, the parameter k is the number of CpG sites in the region, and the subscript ij of each parameter represents the j site of the i sample. Taking beta=0.25 as a preset threshold value, when beta is smaller than the preset threshold value, the j-th and (j+1) -th sites can be substituted into the calculated area statistic B, and the calculation of the area statistic B is possible to be divided into one DMR; when β is greater than or equal to the preset threshold, the jth and (j+1) th sites cannot be substituted into the calculated region statistic B and are not divided into one DMR. Taking b=1 as a preset threshold, and when the B value is smaller than the preset threshold, DMR corresponding to the jth and (j+1) th positions can be used as effective DMR; when the B value is greater than or equal to the preset threshold, DMR corresponding to the jth and (j+1) th positions is not used as an effective DMR.

An exemplary case (as shown in fig. 3) is given in this embodiment to explain the principle of judging whether the DMR is effective in the present invention.

When the DMRs divided by the groups a, B and C respectively contain 10 CpG sites, B _ij of all samples are combined together when calculating the B value corresponding to each DMR, and the average value is calculated as the score of each DMR.

Wherein the calculation steps of the B value in the DMR shown in the group A are shown in the following table:

b-value division of DMR corresponding to group A Less than a preset threshold of 1, and therefore, the DMR may be an effective DMR.

Similarly, the B value score for DMR shown in group B isCan be used as an effective DMR; b value score in DMR shown in sample C is/>Therefore, the DMR corresponding to sample C cannot be valid.

Example 2: cancer detection (Detection of Cancer, DOC) model building

The invention quantifies bias caused by confounding variables for confounding variables (confounding variable) that may affect the accuracy of the classification model, thereby increasing the accuracy and generalizable capability of the DOC model. In the application scenario of the present invention, because ctDNA content in blood of a patient is greatly different in different development stages of liver cancer, the ctDNA content is easily affected by experimental batch effect, and methylation is related to age of a sample source to be tested, race and whether other diseases are suffered, the above conditions may all constitute confounding variables in the present embodiment.

The parameters involved in the formulas shown in this embodiment are defined in accordance with the definitions known in the art, except for the parameters specifically defined and explained.

In order to quantify bias caused by confusion variables, the invention adopts a Salmon model construction method, and an exemplary quantization mode in the embodiment can adopt Hilbert-Schmidt independence Criterion (HSIC). For the model after biased quantization, regularization term (regularization) is embedded for correction.

For quantization using the hilbert-schmitt independence criterion, the following formula is shown:

‖C_h(y)h(z)‖²＝(E_h(x)h(z)-E_h(x)E_h(z))²＝(E_h(x)h(z))²+(E_h(x)E_h(z))²-2E_h(x)h(z)E_h(x)E_h(z)

Wherein L _H (Hilbert-Schmitt independent coefficient, hilbert-SCHMIDT INDEPENDENCE criterion) calculated by the formula is used for representing the independent degree of variables X and Z, and in the invention, a feature vector X (X ₁,…,x_m),x_i is an n-dimensional vector and represents methylation characteristics of a sample i, a classification label Y (Y ₁,…,y_m),y_i is a classification label of X _i, Y _i epsilon-1, +1, positive when Y _i is +1 and negative when Y _i is-1) is set, and a confusion variable Z (Z ₁,…,z_m),z_i is a confusion variable of the sample i and m represents the number of samples).

A support vector machine (SVM, support vector machine) is adopted as a main classifier to carry out two classification, and simultaneously, in order to control confusion variables, regularization terms are added into a target equation solved by the SVM, wherein the target equation is that

s.t.y_i(wTx+b)≥1-ξ_i

ξ_i≥0

Where ζ _i here refers to the degree to which the sample x _i violates the equation, C and λ are the coefficients that minimize training errors with control, minimize the correlation of confounding variables with interpreted variables, and maximize the balance of classification intervals.

In this embodiment, fig. 4 shows the control result of the weight configuration of the DOC model of the present application for the confounding variables.

Wherein each data point represents a blood sample for DOC model construction, the horizontal axis represents confounding variables of the corresponding sample, and the vertical axis represents original uncorrected interpretation variables (left graph) and corrected interpretation variables (right graph), respectively. Comparing the correction before and after, the weight of the confusion variable is controlled in the DOC model established by the invention.

In this example, fig. 5 shows that the DOC model established in the present invention overcomes the weakness of increasing the past methylation false positive with age in healthy groups, and maintains balance in each age group (the horizontal axis represents age, and the vertical axis represents model liver cancer probability score).

Example 3: detection of liver cancer based on DMR by DOC model

100 Healthy human blood samples and 30 liver cancer patient blood samples are used as training sets, 100 DMRs with obvious differentiation are screened out based on the differentiation of healthy people and cancer patients in different DMR areas to be used for constructing a DOC model and determining a threshold value. As shown in table 1 below, 100 DMRs of the present invention for DOC model detection are shown.

TABLE 1100 DMR's screened by methylation detection model of the present invention

/>

And carrying out ten-fold cross validation on the training set, taking the average value of the threshold values corresponding to the optimal conditions of the index of the trade-off sign as the threshold value, and using the average value as the threshold value for dividing the liver cancer yin and yang of the test sample. Specifically, first, the healthy samples in the training set were randomly split into 10 parts, and similarly, the liver cancer samples were also randomly split into 10 parts. Then, a DOC model is established by using the 9/10 healthy sample and the 9/10 liver cancer sample to predict the remaining 1/10 healthy sample and 1/10 liver cancer sample. At this point, the "optimal threshold" for this "fold" can be obtained by the about log index best principle. The loop is repeated until all samples are traversed, and 10 "optimal thresholds" can be obtained due to the ten-fold cross-validation. Finally, the average value of 10 "optimal thresholds" is calculated as the threshold value of the DOC model (for the DOC model of 100 DMRs, the threshold= -0.4), and then the model and the corresponding threshold value can be used to judge the yin-yang of the test set sample, namely, if the average value is smaller than the threshold value, the test set sample is regarded as negative, and if the average value is larger than the threshold value, the test set sample is regarded as positive. According to the DOC model and the threshold values described above, a test set consisting of another 100 healthy person samples and 82 liver cancer samples was evaluated. The overall sensitivity of the test dataset was 86.6% (71/82) and the overall specificity was 90.0% (90/100), with the individual stage sensitivities shown in Table 2:

TABLE 2 sensitivity of liver cancer stages

The threshold value determining method is not limited to a specific 100 DMR, and may be applied to a DOC model formed by smaller DMR in consideration of factors such as cost. For example, 10 DMR's per random choice from 100 DMR may be employed to construct a new DOC model and determine the threshold, see table 3:

TABLE 3 five randomly selected 10 DMR sites and threshold information

The sensitivity, specificity results in these five random replicates are shown in table 4 (by adjustment of model parameters to ensure that each round of randomness ensures that the specificity is controlled to the same level, i.e., 80%):

TABLE 4 sensitivity and specificity of 10 random DMR repeat assays

The sensitivity (including each stage) and specificity of the results of the healthy and liver cancer groups in the above five repeated tests are shown in table 5:

TABLE 5.10 sensitivity and specificity results for each stage in random DMR repeat assays

/>

From this, it can be seen that any 10 DMRs of the 100 DMRs provided by the invention can realize better specificity and sensitivity in each stage of liver cancer, and meet the use expectations.

Furthermore, while the description provides only a DOC model for constructing methylation assays and a method for determining thresholds based on differences between liver cancer patients and healthy persons, in practice this method is equally applicable to other cancer patients, including: carcinoma such as lung cancer, intestinal cancer, pancreatic cancer, esophageal cancer, and ovarian cancer.

The foregoing detailed description is provided by way of explanation and example and is not intended to limit the scope of the appended claims. Numerous variations of the presently illustrated embodiments of the application will be apparent to those of ordinary skill in the art and are intended to be within the scope of the appended claims and equivalents thereof.

Claims

1. A biomarker panel for detecting the methylation level of a test sample, wherein the biomarker panel comprises any of at least 10 differentially methylated regions DMR as set forth in table 1, wherein the reference genome employed by the DMR in table 1 is a GRCh37/hg19 human reference genome;

preferably, the 10 DMRs are any one set of DMRs shown in table 3;

preferably, the biomarker combination comprises all 100 DMRs shown in table 1.

2. A kit, wherein the kit comprises reagents for detecting the biomarker combination of claim 1.

3. The kit of claim 2, wherein the kit comprises next generation sequencing reagents;

Preferably, the next generation sequencing reagents comprise hybridization capture probes or primers covering any at least 10 DMR in table 1;

more preferably, the next generation sequencing reagents comprise hybridization capture probes or primers covering any of the sets of DMRs shown in table 3;

More preferably, the next generation sequencing reagents comprise hybridization capture probes or primers that cover all 100 DMRs shown in table 1.

4. A kit according to claim 2 or 3, wherein the kit is for assessing the correlation of a test sample with the risk of formation of cancer;

Preferably, the cancer comprises: liver cancer, lung cancer, intestinal cancer, pancreatic cancer, esophageal cancer and ovarian cancer.

5. Use of a reagent for detecting a biomarker combination according to claim 1, in the manufacture of a kit for diagnosing risk of cancer formation;

6. A method of detecting the methylation level of a test sample, comprising:

obtaining methylation level data obtained by detecting a biomarker panel in a sample to be tested, wherein the biomarker panel comprises the biomarker panel of claim 1;

Based on the methylation level data, carrying out quantization treatment on bias caused by confusion variables corresponding to the sample to be detected to obtain corrected methylation level data;

Generating indication information for representing the degree of correlation between the sample to be tested and the cancer formation risk based on the comparison result of the corrected methylation level data and a preset threshold value; preferably, in response to the corrected methylation level data being less than or equal to the preset threshold, first indication information for indicating a degree of risk of cancer formation correlation is generated, and in response to the corrected methylation level data being greater than the preset threshold, second indication information for indicating a degree of risk of cancer formation correlation is generated; preferably, the preset threshold value is a mean value of threshold values corresponding to optimal conditions of the login index of each trade-off is obtained through ten-fold cross validation on the training set sample; preferably, the preset threshold is-0.4 to-1.65; further preferably, the preset threshold is-1.22 to-1.65;

7. The method of claim 6, wherein the sample to be tested is selected from any one or more of the following: tissue samples, blood samples, saliva, sputum, pleural effusion, lung lavage, peritoneal effusion, peritoneal lavage, enema, and cerebrospinal fluid.

8. An apparatus for detecting a correlation of a test sample with a risk of cancer formation, comprising:

An acquisition unit configured to acquire methylation level data obtained by detecting a biomarker combination in a sample to be tested, wherein the biomarker combination comprises the biomarker combination according to claim 1;

The correction unit is configured to quantize bias caused by confounding variables corresponding to the sample to be detected based on the methylation level data to obtain corrected methylation level data;

A determining unit configured to generate, based on a result of comparison of the corrected methylation level data and a preset threshold value, indication information for characterizing a degree of correlation of the sample to be tested with a risk of cancer formation; preferably, the determining unit is further configured to generate first indication information for indicating a degree of cancer formation risk correlation in response to the corrected methylation level data being less than or equal to the preset threshold value, and to generate second indication information for indicating a degree of cancer formation risk correlation in response to the corrected methylation level data being greater than the preset threshold value;

preferably, the preset threshold value is a mean value of threshold values corresponding to optimal conditions of the login index of each trade-off is obtained through ten-fold cross validation on the training set sample;

9. An electronic device, comprising:

one or more processors;

A storage device having one or more programs stored thereon,

The one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of claim 6 or 7.

10. A computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by one or more processors implements the method of claim 6 or 7.