CN111910004B

CN111910004B - Application of cfDNA in noninvasive diagnosis of early breast cancer

Info

Publication number: CN111910004B
Application number: CN202010817342.8A
Authority: CN
Inventors: 苏建忠; 刘嘉琦; 赵恒强; 许守平; 吴南; 黄宇宽
Original assignee: HEILONGJIANG PROV TUMOUR HOSPI; Wenzhou Research Institute Of Guoke Wenzhou Institute Of Biomaterials And Engineering; Peking Union Medical College Hospital Chinese Academy of Medical Sciences; Cancer Hospital and Institute of CAMS and PUMC
Current assignee: HEILONGJIANG PROV TUMOUR HOSPI; Wenzhou Research Institute Of Guoke Wenzhou Institute Of Biomaterials And Engineering; Peking Union Medical College Hospital Chinese Academy of Medical Sciences; Cancer Hospital and Institute of CAMS and PUMC
Priority date: 2020-08-14
Filing date: 2020-08-14
Publication date: 2023-09-12
Anticipated expiration: 2040-08-14
Also published as: CN111910004A

Abstract

The invention discloses application of cfDNA in noninvasive diagnosis of early breast cancer, and discloses a cfDNA marker with higher sensitivity and specificity for diagnosing breast cancer and a product prepared based on the marker for diagnosing breast cancer; the invention also discloses a method for screening cancer markers.

Description

Application of cfDNA in noninvasive diagnosis of early breast cancer

Technical Field

The invention belongs to the field of biological medicine, and relates to application of cfDNA in noninvasive diagnosis of early breast cancer.

Background

Breast cancer is the most common cancer in women, and is the second leading cause of cancer death in women, second only to lung cancer. The recent statistics of the national Cancer center shows that the new occurrence of breast Cancer in China is mainly 30-59 years old, and is the primary cause of Cancer death in women under 45 years old (CHEN W, ZHENG R, BAADE P D, et al, cancer statistics in China,2015[ J ]. CA Cancer J Clin,2016,66 (2): 115-132.).

The currently accepted gold standard for breast cancer detection is triad assessment 1) clinical examination, 2) mammography and/or ultrasound examination, 3) fine needle puncture cytology (SALOD Z, SINGH Y. Comparison of the performance of machine learning algorithms in breast cancer screening and detection: A protocol [ J ]. J Public Health Res,2019,8 (3): 1677.). Mammography has been shown to reduce mortality from breast cancer for breast cancer screening. The mammography report and data system (BI-RADS) is a guideline for standardized mammography reports and treatment advice formulated by the american society of radiology (American College of Radiology, ACR) for the purpose of standardizing mammography evaluation terms and evaluation results. Although BI-RADS makes diagnosis of breast diseases more standardized and objective, the common problems in clinical diagnosis are poor inter-observer consistency and high false positive biopsy rate (Hu Yue, yang Yaping, gu Ran, etc.. Age effects on BI-RADS classification 3-5 lesions PPV in diagnostic breast ultrasound [ J ]. Ling nan modern clinical surgery 2018,18 (6): 644-647;Gupta K,Kumaresan M,Venkatesan B,et a1.Sonographic features of invasive ductal breast carcinomas predictive of malignancy grade[J ]. Indian J Radiol Imaging,2018,28 (1): 123-131.). However, its low specificity has always plagued clinicians. In the BI-RADS standard specified by the north american radiation association, class 3 may be benign lesions, class 5 is highly suggestive of malignancy with a likelihood of over 95%, and class 6 corresponds to confirmed malignancy. However, the most uncertain lesions are BI-RADS class 4, whose range of malignancy risk is too large, and the fifth edition of BI-RADS in 2013 classifies the class 4 lesions into three sub-classes, 4a, with a malignancy probability of 2-10%;4b, the probability of malignant tumor is 10-50%; and 4c, the probability of malignancy is 50-95%. Other viable and convenient examination means are urgently needed in clinic to achieve diagnosis of early breast cancer.

In recent years, with the rising of the field of gene research, cfDNA detection has become one of the hot spots of research at home and abroad (Jones PA. Functionalities of DNA methyation: island, start sites, gene bodies and beyond [ J ]. Nat Rev Genet,2012,13 (7): 484-492.), and many domestic studies have confirmed that the cfDNA level of tumor patients is higher than that of normal people, and these studies have focused mainly on the gene level, such as KRAS gene mutation, microsatellite change, methylation abnormality, mitochondrial DNA mutation and the like, all conforming to the related changes of malignant tumor cells. There have been reports on the fact that cfDNA spreads and plants in the distant place of the body through peripheral blood circulation, and is one of important ways for patients to generate tumor micrometastases. And peripheral blood cfDNA as a new tumor biomarker may have higher sensitivity and specificity. And for cfDNA detection, only peripheral venous blood is required to be extracted, the detection is relatively less traumatic, the repeated sampling can be carried out, the correlation between cfDNA and breast cancer is researched, and the method has important clinical significance for realizing noninvasive early diagnosis of breast cancer.

Disclosure of Invention

In order to make up for the defects of the prior art, the invention aims to provide a method and a product for noninvasive diagnosis of early breast cancer.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

in a first aspect the present invention provides a marker for use in predicting early breast cancer, the marker being selected from one or more regions of differential methylation selected from chr1:237343683-237344683, chr2:3723342-3724342, chr2:3978342-3979342, chr2:22327459-22328459, chr4:164543184-164544184, chr6:84666439-84667439, chr8:79343444-79344444, chr15:26569301-26570301, chr15:33374552-33375552, chr15:97703143-97704143.

In a second aspect the invention provides a reagent for detecting early stage breast cancer, the reagent being capable of detecting the methylation status of a marker according to the first aspect of the invention in a sample.

Further, the reagents detect the methylation status of the differentially methylated region using pyrophosphate sequencing, bisulfite sequencing, quantitative and/or qualitative methylation specific polymerase chain reaction, quantitative and/or qualitative bisulfite specific polymerase chain reaction, digital polymerase chain reaction, targeted sequencing in combination with bisulfite, southern blotting, limited landmark genomic scanning, single nucleotide primer extension, cpG island microarray, single nucleotide primer extension, snipe, in combination with sodium bisulfite restriction endonuclease analysis, or mass spectrometry.

Further, the reagent comprises:

a primer set capable of amplifying the methylation region; or (b)

A probe capable of hybridizing to said methylation region; or (b)

A methylation specific binding protein capable of binding the methylation region; or (b)

Methylation-sensitive restriction enzymes; or (b)

Sequencing the primer.

Further, the sample is a blood or plasma sample.

In a third aspect the invention provides a kit or chip for diagnosing early breast cancer comprising the reagent according to the second aspect of the invention.

In a fourth aspect, the present invention provides a method of screening for a cancer marker comprising the steps of:

1) Obtaining tissue samples and blood samples of healthy people and patients;

2) Performing whole genome methylation sequencing of the tissue sample and cfDNA methylation sequencing of the blood sample;

3) Determining the genome-wide methylation state of the sample, and analyzing methylation information of cfDNA;

4) Screening methylation regions with significant differences in healthy people and patients as markers;

further, step 4) includes:

a. screening of tissue differential methylation regions;

cfdna enrichment analysis;

c, cfDNA fragment selection;

calculating cfDNA malignancy ratio;

e. feature selection is performed using a random forest, and final markers are determined.

Further, the differential methylation region in a is a hypomethylation region.

Further, the differential methylation region comprises at least 5 CpG sites.

Further, the p-value of the t-test is <0.001.

Further, the length of the differential methylation region is >500bp.

Further, absolute DNA methylation difference level >0.2.

Further, the selection conditions for cfDNA in c are: the length is less than 160bp;

further, the formula of the malignancy ratio in d is:where yi represents the class label of the i-th fragment in the given region and N represents the total number of cfDNA fragments tested in the given region.

A fifth aspect of the present invention provides a scoring apparatus for diagnosing early breast cancer, comprising the following units:

and a detection unit: detecting the methylation status of a marker according to the first aspect of the invention in a sample;

analysis unit: taking the methylation state of the detected gene as an input variable, and inputting a model for predicting breast cancer risk for analysis;

an evaluation unit: and outputting a risk value of the breast cancer of the individual corresponding to the sample.

Further, the diagnostic model is determined using one or more algorithms selected from the group consisting of: principal component analysis, logistic regression analysis, nearest neighbor analysis, support vector machine, neural network model, random forest.

Further, random forests are used for determination.

Further, the diagnostic model has a cutoff value of 0.5, and when the score is higher than the cutoff value of 0.5, it is determined that the individual is at risk of having breast cancer, and further pathological tissue detection is required.

A sixth aspect of the invention provides the use of any one of the following:

1) The application of the marker in the first aspect of the invention in preparing a product for diagnosing early breast cancer;

2) The application of the marker in constructing a diagnosis model of early breast cancer;

3) The use of an agent according to the second aspect of the invention for the preparation of a product for diagnosing early breast cancer;

4) The kit or the chip of the third aspect of the invention is applied to the preparation of products for diagnosing early breast cancer;

5) The use of the method of the fourth aspect of the invention for screening for a cancer marker;

6) The method according to the fourth aspect of the invention is used for constructing a diagnostic model of early breast cancer. Further, the diagnosis in 1), 3), 4) is a non-invasive diagnosis.

A seventh aspect of the invention discloses a diagnostic model of early breast cancer scoring the markers of the first aspect of the invention using one or more of the following algorithms: principal component analysis, logistic regression analysis, nearest neighbor analysis, support vector machine, neural network model, random forest.

Further, the diagnostic model scores the markers of the first aspect of the invention using a random forest.

Further, the diagnostic model has a cutoff value of 0.5, and when the score is above the cutoff value of 0.5, the subject is indicated to have breast cancer.

The invention has the advantages and beneficial effects that:

the invention discovers the hypomethylation region of cfDNA related to breast cancer for the first time through genome-wide cfDNA methylation sequencing, wherein the hypomethylation region is chr1:237343683-237344683, chr2:3723342-3724342, chr2:3978342-3979342, chr2:22327459-22328459, chr4:164543184-164544184, chr6:84666439-84667439, chr8:79343444-79344444, chr15:26569301-26570301, chr15:33374552-33375552, chr15:97703143-97704143 by detecting the methylation state of the above-mentioned differential methylation region, early diagnosis of breast cancer can be achieved. The product or means for diagnosing breast cancer disclosed by the invention has higher sensitivity, specificity and accuracy, and has noninvasive property.

The invention relates to a plasma free DNA methylation detection kit based on a hypomethylation region, which has high ctDNA enrichment degree and more reliable results.

Drawings

FIG. 1 is a graph of cfDNA in breast cancer patients and breast benign lesions patients; wherein, panel a is a graph of average cfDNA concentration; panel B is a fragment size distribution diagram of cfDNA in a patient with benign breast lesions; panel C is a fragment size distribution of cfDNA in breast cancer patients.

FIG. 2 is a cfDNA methylation profile; wherein, panel a is an average coverage depth map of cfDNA fragments, TSS, transcription initiation site, TES, transcription termination site; panel B is an average coverage depth map of cfDNA fragments of different genomic regions; panel C is a CpG density map of different differentially methylated regions; panel D is a plot of DMR duty cycle for different numbers of cfDNA fragments.

FIG. 3 is a graph of cfDNA methylation efficacy in the diagnosis of breast cancer; wherein, panel A is a malignant ratio plot of the best 10 cfDNA hypomethylation regions maker, ns: not sign; * p is less than or equal to 0.05; * P is less than or equal to 0.01; * P is less than or equal to 0.001; * P is less than or equal to 0.0001; b represents benign, M represents malignant; panel B is a ROC plot of cfDNA methylation model in a training set; panel C is a ROC graph of cfDNA methylation model in a training set; FIG. D is a ROC graph of a joint diagnostic model of cfDNA joint imaging examination in a training set; panel E is a ROC plot of a joint diagnostic model of cfDNA joint imaging examination in a validation set.

FIG. 4 is a ROC graph of a conventional diagnosis, wherein graph A molybdenum target X-ray diagnostic, graph B ultrasonic diagnostic, graph C CEA diagnostic, graph D CA15-3 diagnostic, and gray area represents 95% confidence interval; ns is not signalicant; * p is less than or equal to 0.05; * P is less than or equal to 0.01; * P is less than or equal to 0.001; * P is less than or equal to 0.0001.

FIG. 5 is a graph of cfMeth scores in different clinical categories; wherein, graph A is a distribution graph of cfMeth scores in each class of imaging BI-RADS; panel B is a graph of the distribution of cfMeth scores in different clinical features.

FIG. 6 is a graph of the correlation of cfMeth score with Ki67, tumor size, estrogen Receptor (ER), progestin Receptor (PR); wherein, graph a is a graph of the correlation of cfMeth score with Ki 67; panel B is a graph of the correlation of cfMeth score with tumor size; panel C is a graph of the correlation of cfMeth score with ER; graph D is a correlation graph of cfMeth score versus PR.

Detailed Description

The present inventors established a unified, standard method for aiding in diagnosing early breast cancer with the change of methylation status of specific regions for disease detection, and screened specific methylation regions associated with the disease and early stage of breast cancer, thereby completing the present invention. Through extensive and intensive research, the invention firstly screens a Differential Methylation Region (DMR) by detecting the methylation state of genome in tissues, determines the differential methylation region as a hypomethylation region, then carries out cfDNA enrichment, calculates cfDNA enrichment scores, further screens cfDNA with the fragment length of more than 160bp by fragment size, analyzes the source (malignant or benign) of each fragment, and further optimizes the characteristics by a random forest classifier to obtain the marker of the invention.

In the present invention, "diagnosis" and "risk of developing" have meanings well known in the art, for example, "diagnosis" is to judge whether or not the disease is developed, and "risk of developing" is to evaluate the size of risk of developing a disease and the size of risk of recurrence after treatment.

According to an alternative embodiment of the present invention, there is provided an agent for detecting methylation status of cfDNA differential methylation regions in a test sample of a subject in the preparation of a product for diagnosing early breast cancer, said methylation regions being selected from chr1:237343683-237344683, chr2:3723342-3724342, chr2:3978342-3979342, chr2:22327459-22328459, chr4:164543184-164544184, chr6:84666439-84667439, chr8:79343444-79344444, chr15:26569301-26570301, chr15:33374552-33375552, chr15:97703143-97704143. In a specific embodiment, the methylation malignant proportion of the methylation region in the breast cancer patient is significantly higher than that of the breast benign injury patient, and the breast cancer patient and the breast benign injury disease patient can be effectively distinguished.

As a preferred embodiment, the methylation region is selected from chr1:237343683-237344683, chr2:3723342-3724342, chr2:3978342-3979342, chr2:22327459-22328459, chr4:164543184-164544184, chr6:84666439-84667439, chr8:79343444-79344444, chr15:26569301-26570301, chr15:33374552-33375552, chr15:97703143-97704143.

The invention discloses a reagent for detecting early breast cancer and a product containing the reagent, wherein the reagent can detect the methylation state of a marker in a sample. The present invention can utilize any method known in the art to determine the methylation status of a marker. It will be appreciated in the art that the method of determining the methylation status of a marker is not an important aspect of the invention. As alternative embodiments, methods of detecting methylation include, but are not limited to, pyrosequencing, bisulfite sequencing, quantitative and/or qualitative methylation specific polymerase chain reaction, quantitative and/or qualitative bisulfite specific polymerase chain reaction, digital polymerase chain reaction, targeted sequencing of combined bisulfite, southern blotting, limiting landmark genomic scanning, single nucleotide primer extension, cpG island microarray, single nucleotide primer extension, combined sodium bisulfite restriction endonuclease analysis, or mass spectrometry.

The term "polymerase chain reaction" is used to amplify a target sequence, the method consisting of the steps of: a large excess of the two oligonucleotide primers is introduced into a DNA mixture containing the desired target sequence, followed by a precise thermal cycling sequence in the presence of a DNA polymerase. Both primers are complementary to the corresponding strands of the double stranded target sequence. For amplification, the mixture is denatured and the primers are then annealed to their complementary sequences within the target molecule. After annealing, the primers are amplified with a polymerase to form a pair of new complementary strands. The steps of denaturation, primer annealing, and polymerase extension can be repeated multiple times (i.e., denaturation, annealing, and extension constitute one "cycle; there can be many" cycles ") to obtain high concentrations of amplified fragments of the desired target sequence. The length of the amplified fragment of the desired target sequence is determined by the relative positions of the primers with respect to each other, and is thus a controllable parameter. Because of the repeated aspects of the method, the method is referred to as "polymerase chain reaction" ("PCR"). Since the desired amplified fragment of the target sequence becomes the primary sequence (in terms of concentration) in the mixture, it is said to be "PCR amplified", either as a "PCR product" or as an "amplicon".

The term "primer" refers to an oligonucleotide naturally occurring or synthetically produced in a purified restriction digest that is capable of acting as an origin of synthesis when subjected to conditions in which synthesis of a primer extension product complementary to a nucleic acid strand is induced (e.g., in the presence of a nucleotide and an inducer such as a DNA polymerase and at a suitable temperature and pH). The primer is preferably single stranded for maximum efficiency of amplification, but may also be double stranded. If double stranded, the primer is first treated to separate its strand before use in preparing the extension product. Preferably, the primer is an oligodeoxyribonucleotide. The primer must be long enough to prime the synthesis of the extension product in the presence of the inducer. The exact length of the primer will depend on many factors, including temperature, source of primer, and use of the method.

The term "probe" refers to an oligonucleotide (e.g., nucleotide sequence) that occurs naturally in a purified restriction digest or that is synthesized, recombinant, or produced by PCR amplification, which is capable of hybridizing to another oligonucleotide of interest. Probes may be single-stranded or double-stranded. Probes can be used for detection, identification, and isolation of specific gene sequences (e.g., a "capture probe"). It is contemplated that in some embodiments, any probe used in the present invention may be labeled with any "reporter" such that it is detectable in any detection system.

As used herein, "methylation" refers to cytosine methylation at cytosine positions C5 or N4, N6 site of adenine, or other types of nucleic acid methylation. In vitro amplified DNA is typically unmethylated because in general in vitro DNA amplification methods are unable to preserve the methylation pattern of the amplified template. However, "unmethylated DNA" or "methylated DNA" may also refer to amplified DNA of which the original template is unmethylated or methylated, respectively.

Thus, as used herein, "methylated nucleotide" or "methylated nucleotide base" refers to the presence of a methyl moiety on a nucleotide base, where the methyl moiety is not present in a recognized typical nucleotide base.

Methylation status may optionally be represented or indicated by a "methylation value" (e.g., representing methylation frequency, fraction, proportion, percentage, etc.). Methylation values can be generated, for example, by quantifying the amount of intact nucleic acid present after restriction digestion with a methylation dependent restriction enzyme, or by comparing amplification spectra after a bisulfite reaction, or by comparing the sequences of bisulfite treated and untreated nucleic acid. Thus, a value such as a methylation value represents methylation status and thus can be used as a quantitative indicator of methylation status in multiple copies of a locus. The degree of co-methylation is represented or indicated by the methylation state of more than one methylation site, defined as co-methylation when the methylation state of more than one methylation site is methylated in a single methylation region.

As used herein, the term "bisulfite reagent" refers to a reagent that in some embodiments comprises bisulfite (biosulfite), bisulfite (disufite), bisulfite (hydrosulfite), or a combination thereof, that is a DNA treated with a bisulfite reagent that converts unmethylated cytosine nucleotides to uracil, while methylated cytosines and other bases remain unchanged, thus allowing discrimination between methylated and unmethylated cytosines in, for example, cpG dinucleotide sequences.

The term "methylation assay" refers to any assay for determining the methylation state of one or more CpG dinucleotide sequences within a nucleic acid sequence.

Methylation data analysis method

In certain embodiments, the methylation values of the biomarker measurements for a panel of biomarkers are mathematically combined and the combined values are correlated with a potential diagnostic problem. In some cases, the methylated biomarker values are combined by any suitable existing mathematical method. Well-known mathematical methods for associating biomarker combinations with disease states employ methods such as Discriminant Analysis (DA) (e.g., linear DA, quadratic DA, regularized DA), discriminant Function Analysis (DFA), kernel methods (e.g., SVM), multidimensional scaling (MDS), non-parametric methods (e.g., k-nearest neighbor classifier), PLS (partial least squares), tree-based methods (e.g., logistic regression, CART, random forest methods, boosting/Bagging methods), generalized linear models (e.g., logistic regression), principal component-based methods (e.g., SIMCA), generalized addition models, fuzzy logic-based methods, neural network and genetic algorithm-based methods. The skilled artisan will have no difficulty in selecting an appropriate method of assessing the epigenetic markers or biomarker combinations described herein.

In one embodiment, the method used in correlating methylation status of an epigenetic marker or biomarker combination with, for example, diagnosing breast cancer is selected from DA (e.g., linear discriminant analysis, quadratic discriminant analysis, regularized discriminant analysis), DFA, kernel methods (e.g., SVM), MDS, nonparametric methods (e.g., k-nearest neighbor classifier), PLS (partial least squares), tree-based methods (e.g., logistic regression, CART, random forest methods, boosting methods) or generalized linear models (e.g., logistic regression) and principal component analysis.

The invention will now be described in further detail with reference to the drawings and examples. The following examples are only illustrative of the present invention and are not intended to limit the scope of the invention. The experimental methods for which specific conditions are not specified in the examples are generally conducted under conventional conditions or under conditions recommended by the manufacturer.

Examples

1. Study object

210 BI-RADS4 lesions found by conventional ultrasonic examination in tumor hospitals of the Chinese medical science sciences and Beijing synergetic hospitals (CHCAMS, n=160, training set) and affiliated tumor hospitals of the Harbin medical sciences (HMUCH, n=50, validation set) on 1 st 4 th 2019 to 31 st 8 th 2019 are selected, the patients are female, all the patients are subjected to molybdenum target X-ray examination and ultrasonic examination, and all the patients are diagnosed by operation or puncture biopsy pathology examination. In this study, phenotyping was performed on patients of training and validation sets for age, imaging performance, pathological characteristics, molecular subtypes, stage and serum tumor markers such as carcinoembryonic antigen (CEA) and carcinoembryonic antigen 15-3 (CA 15-3). Two experienced radiologists at each center interpret the images and classify them according to the fifth edition of the BI-RADS classification standard. CEA and CA15-3 analysis was performed at each hospital. Threshold levels of CEA and CA15-3 were set at 5.0ng/mL and 25.0U/mL, respectively, and diagnosis for each patient was based on pathology results from resected specimens. HR positive (HR, including estrogen receptor and progestin receptor) means that more than 1% of tumor cells stain positive for estrogen receptor or progestin receptor protein. ERBB2/HER2 positive means that ERBB2 protein staining in tumor cells is positive (3+) or that ERBB2 gene is amplified in tumor cells. Triple negative means that estrogen receptor, progestogen receptor and human epidermal growth factor receptor 2 are all negative. Referring to the st. Gallen 2017 standard, clinical groupings of molecular subtypes are defined according to the status of hormone receptors and HER 2. Referring to the united states joint cancer committee (AJCC) breast cancer staging system eighth edition, cancer staging is defined in terms of the status of primary tumors (T), lymph nodes (N) and metastases (M). The clinical information of the patient is shown in table 1.

Table 1 clinical information of patient

2. Sample collection

2.1 tumor sample collection and extraction of genomic DNA

Tumor biopsies of 20 patients who were operated at tumor hospitals of the national academy of medical science, including 10 malignant tumors and 10 benign tumors of the breast, were collected. The histological type of the tumor was determined for each patient by the pathological results of hematoxylin and eosin staining. Genomic DNA was extracted from freshly frozen tumor tissue using QIAamp DNA Mini Kit (QIAGEN, germany).

2.2 blood sample collection and extraction of cfDNA

Blood samples of all patients before surgery were collected and stored in 10ml cell-freeBlood collection tubes (Streck, USA), placed at room temperature, centrifuged for 10min at 1800g to obtain plasma, and cfDNA was extracted from the plasma using QIAmp Circulating Nucleic Acid Kit (Qiagen, USA) for specific steps as detailed in the specification. Quantification of cfDNA was performed using qubit3.0 of dsDNA HS Assay Kit (Life Technologies, USA) and DNA was stored at-80 ℃.

3. Genomic DNA methylation library preparation.

Genomic DNA and unmethylated lambda DNA (Promega, U.S.) were sonicated into about 350bp fragments using a Covaris S220 instrument (Covaris, U.S.), genomic DNA (200 ng) and 0.5% unmethylated lambda DNA were mixed, DNA modified using a EZ DNA Methylation-Lightning Kit (Zymo Research, USA), and DNA methylation libraries were constructed using Accel-NGS Methyl-Seq DNA Library Kit and Methyl-Seq Dual Indexing Kit (Swift Biosciences, USA), for specific procedures.

4. construction of cfDNA methylation library

cfDNA and unmethylated lambda DNA (Promega, USA) were sonicated to about 350bp fragments using a Covaris S220 instrument (Covaris, USA), cfDNA (200 ng) and 0.5% unmethylated lambda DNA were mixed, DNA modified using a EZ DNA Methylation-Lightning Kit (Zymo Research, USA), and DNA methylation libraries were constructed using Accel-NGS Methyl-Seq DNA Library Kit and Methyl-Seq Dual Indexing Kit (Swift Biosciences, USA), for specific procedures as detailed in the description.

5. Library quantification and Whole Genome Bisulfite Sequencing (WGBS)

Libraries were quantified using Qubit dsDNA HS Assay Kit (Life Technologies, USA) and KAPA Library Quantification Kit (KAPA biosystems, USA) and library quality was assessed using an Agilent 2100Bioanalyzer (Agilent, USA). 2X 150bp sequencing was performed using Illumina HiSeq sequencing platform with a sequencing depth of 30X for genomic DNA and 10X for cfdna.

6. Quality control, data processing and analysis

Quality Control (QC) analysis was performed using FastQC (version 0.11.8, www.bioinformation.babraham.ac. uk/projects/FastQC /) to evaluate the read quality of WGBS. The original sequencing reads were treated with trim_galore (version 0.6.0, www.bioinformation.babraham.ac.uk/projects/trim_galore /), adapter contamination was removed and poor quality reads were filtered out. Sequencing reads were mapped using Bismark (0.22.1 version). ) Variant calls and notes were made using PUMP, deleting all mutated CpG sites. The use of Bismark recognizes the C site in CpG. For tissue samples, the methylation level of CpG sites was calculated using the Bismark "methylation extractor" command. For plasma samples, the methylation status of individual cpgs in each read was retained for further analysis. Sequence alignment was performed using the Samtools package (version 1.9), genomic features were compared, related operations and annotated using bedtools (version 2.29.0), and genomic features were represented in Browser Extensible Data (BED) file.

7. Identification algorithm and machine learning of cfDNA methylation markers

A comprehensive procedure was devised to determine the optimal cfDNA methylation signature to distinguish benign and malignant samples of blood-based WGBS data. It comprises several steps, considering the origin of breast tumor tissue, cfDNA fragment enrichment, fragment size selection, cfDNA malignancy ratio and optimal marker selection.

7.1 identification of differential methylation regions of primary tumor tissue

Differential Methylation Regions (DMR) were identified from WGBS data of 10 benign and 9 malignant breast primary tissue samples using SMART 2. Screening conditions for DMR: DMR comprises at least 5 CpG sites, p-value <0.001, length greater than 500bp for t-test, absolute DNA methylation difference level >0.2.

7.2cfDNA enrichment analysis

Enrichment analysis was performed on cfDNA using Refseq gene annotation in UCSC table browser. Each gene was normalized to 20kb and the 10kb flanking region was divided into 40 bins with a 100bp window. cfDNA enrichment scores were calculated by the average number of fragments in DMR. The total read number was normalized to 2.5 hundred million for each sample. For DNA sequence characterization, the human genome was divided into about 300 ten thousand bins of 1kb in size, and the CPG density, g+c content, and cfDNA enrichment score in each bin were calculated. The correlation between the average coverage depth and the CpG density (g+c content) for each bin was analyzed using linear regression.

7.3 selection of fragment size to enhance detection of ctDNA

In plasma samples of all patients, the peak of cfDNA fragment length of WGBS was about 167bp. Studies have shown that the fragment size of ctDNA is shorter than that of non-tumor cfDNA. Statistical conclusions to screen cfDNA fragment lengths (< 160 bp) to increase ctDNA ratio 7.4cfDNA malignancy ratio in order to reduce the impact of large amounts of non-tumor cfDNA in plasma

Theoretically, even with shorter cfDNA, the ctDNA content is still lower. Traditional methods of detecting DMR using average methylation level differences are masked by high proportions of non-tumor cfDNA in plasma. Based on the distribution of DNA methylation patterns in DMR tissues, a fragment-based strategy was designed to statistically infer the origin of each fragment (whether malignant or not). In this study, the malignant tumor origin of each fragment in the region of interest was identified with single base resolution using the group Diagonal Quadratic Discriminant Analysis (DQDA) method.

DQDA derives from Bayesian rules

P(y ^* ＝k|x ^* )∝f _k (x ^* )π _k (1)

Wherein y is* Class labels (k=0, benign; k=1, malignant) representing each fragment, x represents a vector of all CpG methylation states in one cfDNA fragment, f _k Represents the x probability density function in class k, and pi _k Is the prior probability from the segment in class k. The decision rule is to assign x to tagged with the assumption of different covariance matrices of different setsCategory of->As a discrimination score is defined as

It should be noted that the overall parameters in the scores provided above are unknown and still need to be estimated from the sample data. Here, it willDiagonal matrix set as sample covariance matrix, use +.>And D _k Replacement mu _k Sum sigma _k To form a DQDA,

wherein, the liquid crystal display device comprises a liquid crystal display device,is a fragment with p CpG sites, k=0, 1 represents benign and malignant samples, respectively,/->And->Is the average value of samples of the kth group of ith CpG sites, n _k Is the number of samples in each group. Considering that only a small number of cfDNA fragments should be from malignant tissue, the prior probability pi is evaluated in the training set data _k 0.1. After evaluating the DQDA score between a given fragment and benign/malignant reference using the above formula, use is made ofThe cfDNA malignancy ratio for a given region in each sample was calculated as follows to infer the source of the fragment:

wherein y is _i * Class labels representing the i-th fragment in a given region, N representing the total number of cfDNA fragments tested in the given region.

7.5 feature selection and model construction

To determine the optimal cfDNA methylation markers to distinguish between malignant and benign samples, the malignancy ratio of each sample of hypomethylated regions was calculated using a sliding window of 1-kb. First, using t-test, 1000 regions most significantly different in the benign and malignant samples were found as candidates for markers. Then, a random forest algorithm is used, a feature selection process of gradually reducing the number of features is carried out based on Recursive Feature Elimination (RFE) strategy, the importance of the features is evaluated according to gini index, 25% of the least important features are removed each time, the features are gradually reduced, and in order to balance the complexity of a model and the performance of the model, 10 areas are finally selected to be used as markers. Finally, a random forest model was constructed based on the training set using 10 markers (1000 trees, each tree constructed using 45 benign samples, 45 malignant samples). In the training set, model scoring of each sample is obtained based on out-of-bag (OOB) data, in the testing set, a random forest model is directly applied to obtain model scoring of each sample, and then the prediction of subsequent diseases is carried out.

7.6 construction of Joint diagnostic model

The scoring of cfMeth for patients, molybdenum target X-ray scoring and ultrasound scoring are known for three features, wherein molybdenum target X-ray scoring and ultrasound scoring criteria: BI-RADS4a, BI-RADS4b, BI-RADS4c are 0 score, 0.5 score, 1 score, respectively.

Firstly, using LASSO-compensated logistic regression with 10 times of cross verification, taking average absolute error as an evaluation index, selecting to obtain an optimal lambda parameter, then using the optimal lambda parameter, constructing a LASSO model based on a training set, and finally obtaining a joint diagnosis model.

8. Statistical analysis

The age, cfDNA concentration, and differences in cfDNA methylation markers of subjects were analyzed using Student's t-test. The Pearson chi-square test was used to examine the difference in enrichment of hypermethylated region (hyper-DMR) and hypomethylated region (hypo-DMR) in cfDNA. cfDNA methylation markers, combined with molybdenum target X-ray and ultrasound examination were calculated using ROC to calculate sensitivity, specificity, accuracy, and AUC. Statistical analysis was performed using R statistical software version 3.5.1.

9. Results

Comparing cfDNA of breast cancer patients and breast benign injury patients, it was found that there was no significant difference in cfDNA concentration, fragment size distribution (fig. 1).

cfDNA profile for all patients as shown in fig. 2, the peak of cfDNA fragment length was about 167bp (fig. 2A), cfDNA content of different genes was inversely related to CpG density (fig. 2B); cpG density of the hyper-DMRs was significantly higher than that of hypo-DMRs (FIG. 2C); the average content of cfDNA fragments of hypo-DMRs was significantly higher for all samples than for hyper-DMRs (fig. 2D), in order to ensure quantification of cfDNA of high quality, hypomethylated regions were selected as candidate DNA methylation markers.

Markers (markers) with higher diagnostic efficacy were screened by bioinformatics analysis, and specific information is shown in table 2.

TABLE 2 methylation markers

^a Importance scores were obtained by Gini index evaluation in random forests

Constructing a diagnosis model through random forests according to the 10 screened markers, constructing a joint diagnosis model through LASSO, constructing the LASSO model based on a training set by using an optimal lambda parameter (lambda= 0.02317884) in the construction of the joint diagnosis model, and finally obtaining the coefficients of the joint diagnosis model of C1 (coef) respectively _cfMeth )、C2(Coef _{Ultrasonic inspection} )、C3(Coef _{Molybdenum target X-ray inspection} ) The diagnostic efficacy of the cfDNA methylation model and the combined diagnostic model, 5.028952, 1.628452, 1.106189, respectively, is shown in table 3 and fig. 3, with 10 markers having higher cfDNA malignancy than those with benign lesions of the breast. The cfDNA methylation model has higher sensitivity, specificity and accuracy when applied to diagnosis of breast cancer, and is remarkably higher than the diagnosis efficacy of molybdenum target X-ray examination, ultrasonic examination, CEA and CA15-3 (fig. 4), and the combined diagnosis model has higher sensitivity and specificity.

Table 3 diagnostic model predicting diagnostic efficacy of breast cancer

The cfDNA methylation model (cfDNA methylation score) is shown in tables 4-5 and fig. 5 for the classification of cfDNA methylation models in imaging BI-RADS and the differentiation of cfDNA methylation models from combined diagnostic models for benign and malignant tumors in each clinical profile, with the free DNA methylation score having a better differentiation for benign and malignant tumors (fig. 5A), and cfMeth score being significantly higher in different clinical subgroups of breast cancer than in benign groups. Compared with BI-RADS classification, cfDNA methylation score is stable in expression in each clinical subgroup, and is more accurate and convenient for judging breast cancer.

The relationship between cfDNA methylation score and Ki67, tumor size, ER, PR is shown in fig. 6, from which it can be seen that cfDNA methylation score is positively correlated with Ki67 and tumor size (fig. 6A and 6B), negatively correlated with ER (fig. 6C), and independent of PR (fig. 6D).

TABLE 4 diagnostic accuracy of cfDNA methylation scoring in BI-RADS classifications

TABLE 5 efficiency of breast cancer detection in various clinical features for cfDNA methylation and combined diagnostic models

The above description of the embodiments is only for the understanding of the method of the present invention and its core ideas. It should be noted that it will be apparent to those skilled in the art that several improvements and modifications can be made to the present invention without departing from the principle of the invention, and these improvements and modifications will fall within the scope of the claims of the invention.

Claims

1. Use of a methylation level of a marker selected from the group consisting of differential methylation regions of cfDNA selected from chr1 in the construction of a diagnostic model of benign and malignant breast tumor: 237343683-237344683, chr2:3723342-3724342, chr2:3978342-3979342, chr2:22327459-22328459, chr4:164543184-164544184, chr6:84666439-84667439, chr8:79343444-79344444, chr15:26569301-26570301, chr15:33374552-33375552, chr15:97703143-97704143.

2. The application of a reagent for detecting the methylation level of a marker in a sample in preparing a product for diagnosing benign and malignant breast tumor; characterized in that the marker is selected from the group consisting of differential methylation regions of cfDNA selected from chr1:237343683-237344683, chr2:3723342-3724342, chr2:3978342-3979342, chr2:22327459-22328459, chr4:164543184-164544184, chr6:84666439-84667439, chr8:79343444-79344444, chr15:26569301-26570301, chr15:33374552-33375552, chr15:97703143-97704143.

3. The use according to claim 2, wherein the reagent detects the methylation status of the differentially methylated region using pyrophosphate sequencing, bisulfite sequencing, quantitative and/or qualitative methylation specific polymerase chain reaction, quantitative and/or qualitative bisulfite specific polymerase chain reaction, digital polymerase chain reaction, targeted sequencing in combination with bisulfite, southern blotting, limiting landmark genome scanning, single nucleotide primer extension, cpG island microarray, single nucleotide primer extension, snepe, in combination with sodium bisulfite restriction endonuclease analysis, or mass spectrometry.

4. The use according to claim 2, wherein the agent comprises:

a primer set capable of amplifying the methylation region; or (b)

A probe capable of hybridizing to said methylation region; or (b)

Methylation-sensitive restriction enzymes; or (b)

Sequencing the primer.

5. The use according to any one of claims 2-4, wherein the sample is a blood or plasma sample.

6. Use of a kit or chip for the preparation of a product for diagnosing benign-malignant breast tumor, characterized in that the kit or chip comprises reagents for detecting the methylation level of a marker in a sample, said marker being selected from the group consisting of differential methylation regions of cfDNA, said differential methylation regions being selected from the group consisting of chr1:237343683-237344683, chr2:3723342-3724342, chr2:3978342-3979342, chr2:22327459-22328459, chr4:164543184-164544184, chr6:84666439-84667439, chr8:79343444-79344444, chr15:26569301-26570301, chr15:33374552-33375552, chr15:97703143-97704143.

7. The use according to claim 6, wherein the reagent detects the methylation status of the differentially methylated region using pyrophosphate sequencing, bisulfite sequencing, quantitative and/or qualitative methylation specific polymerase chain reaction, quantitative and/or qualitative bisulfite specific polymerase chain reaction, digital polymerase chain reaction, targeted sequencing in combination with bisulfite, southern blotting, limiting landmark genome scanning, single nucleotide primer extension, cpG island microarray, single nucleotide primer extension, sodium bisulfite restriction endonuclease assay, or mass spectrometry.

8. The use according to claim 6, wherein the agent comprises:

a primer set capable of amplifying the methylation region; or (b)

A probe capable of hybridizing to said methylation region; or (b)

Methylation-sensitive restriction enzymes; or (b)

Sequencing the primer.

9. The use according to any one of claims 6-8, wherein the sample is a blood or plasma sample.