CN116153420B

CN116153420B - Application of gene marker in early screening of malignant breast cancer and benign breast nodule and construction method of screening model

Info

Publication number: CN116153420B
Application number: CN202310446774.6A
Authority: CN
Inventors: 邵阳; 吴雪; 包华; 刘睿; 吴舒雨; 唐皖湘夫; 唐诗婷; 刘思思
Original assignee: Nanjing Shihe Medical Devices Co ltd; Nanjing Shihe Gene Biotechnology Co ltd
Current assignee: Nanjing Shihe Medical Devices Co ltd; Nanjing Shihe Gene Biotechnology Co ltd
Priority date: 2023-04-24
Filing date: 2023-04-24
Publication date: 2023-08-18
Anticipated expiration: 2043-04-24
Also published as: CN116153420A

Abstract

The invention relates to an application of a gene marker in early screening of malignant breast cancer and benign breast nodules and a construction method of a screening model, wherein a liquid biopsy whole genome WGS (high-performance sequencing system) low-depth sequencing is carried out on a blood plasma sample cfDNA, a window copy number variation analysis (CNV), a DNA fragmentation distribution difference (FSD), a DNA fragment length ratio difference (FSR), a DNA breakpoint sequence (BPM) and a DNA end sequence (EDM) are used for constructing an integrated model of a multi-feature multi-algorithm by utilizing automatic machine learning, so that the purpose of noninvasive accurate diagnosis of breast cancer is realized.

Description

Application of gene marker in early screening of malignant breast cancer and benign breast nodule and construction method of screening model

Technical Field

The invention relates to early screening of malignant breast cancer and benign breast nodules, and belongs to the field of molecular biomedicine.

Background

Breast cancer is the most common among women worldwide and the most common type of cancer that causes death in women. According to global cancer report of GLOBOCAN2020, 226 thousands of new breast cancers account for 11.7% of the total cancer incidence in 2020, become global first malignant cancers, seriously threaten the physical and mental health of women and affect the quality of life. Research shows that early breast cancer screening can discover breast cancer earlier, and survival rate and quality of life are improved. Currently, the most common screening methods for Breast cancer are mainly Breast ultrasound (Breast Ultrasound), nuclear magnetic resonance (Breast MRI), automated Breast ultrasound systems (Automated Breast Ultrasound System, ABUS) and Mammography (Mammogram). However, each of the existing prior art techniques has some drawbacks, in that the quality of the examination of breast ultrasound techniques depends to some extent on the experience of the operator, patient compliance with nuclear magnetic resonance techniques is not high, and automated breast ultrasound systems are costly. Mammography, which has the highest rate of use, is currently the primary screening method for early breast cancer, but has a difference in sensitivity of detection for different breast types of patients. For example, in younger women, mammography has lower screening accuracy, and for women over 50 years old, since fibroglandular tissue of the breast becomes adipose tissue to replace with age, abnormal lesions near the adipose tissue are more easily detected by mammography, and screening accuracy becomes higher. Screening sensitivity has a certain relationship with age. In addition, for extremely dense chest forms (Almost entirely dense tissue) which account for about 10% of the total, there is the disadvantage of overdiagnosis and low sensitivity. Studies have shown that the AUC for the detection using mammography is 0.79 and that for the detection using breast ultrasound is 0.78. The sensitivity of the image detection of the breast cancer is limited to a certain extent, and the image detection is used as the basis for diagnosing the breast tumor, so that the risk of unnecessary invasive surgery is increased, and therefore, development of effective, practical and high-sensitivity screening means suitable for a wide population is urgently needed to carry out auxiliary screening on the population with high risk in the image detection diagnosis.

Disclosure of Invention

The invention provides a method for carrying out WGS sequencing on a blood plasma sample cfDNA, carrying out copy number change (copy number variation, CNV) of a characteristic difference 1Mb window of malignant breast cancer and benign nodule through a high-throughput sequencing result, carrying out DNA fragmentation distribution (Fragment size distribution, FSD), analyzing a DNA fragmentation length proportion (Fragment size ratio, FSR), a breakpoint sequence (Break Point Motif, BPM) and an End sequence (End Point Motif, EDM), and respectively modeling by utilizing a Generalized Linear Model (GLM), gradient Boost (GBM), random Forest (RF), deep Learning (DL) and extreme Gradient Boost (extreme Gradient boosting, xgboost), finally integrating a multi-feature multi-algorithm through a method for taking an average value thereof, obtaining a final risk coefficient and classifying, thereby realizing the purpose of noninvasive accurate diagnosis of malignant breast cancer.

Use of a genetic marker in the early screening of malignant breast cancer and benign breast nodules, said genetic marker comprising:

a first marker: copy number in different windows on chromosomes in WGS data;

a second marker: comparing the cfDNA fragments to the short read number ratio and the long read number ratio in different windows of the reference genome; the base length of the short reading segment is 100-150bp, and the base length of the long reading segment is 151-220bp;

third marker: comparing cfDNA fragments to the number of reads in different length gradient intervals on the long and short arms of the reference genome; the gradient intervals with different lengths are gradient ranges with different lengths, which are obtained by increasing the gradient ranges with the step length of 4-5bp in the range of 100-220 bp; the long and short arms are selected from the following chromosome arms:

chr1_p, ch4_q, chr8_p, ch11_q, chr16_q, ch20_p, ch1_q, ch5_p, chr8_q, chr12_p, ch17_p, ch20_q, ch2_p, ch5_q, ch9_p, ch12_q, ch17_q, ch21_q, ch2_q, ch6_p, ch9_q, ch13_q, ch18_p, ch22_q, chr3_p, ch6_q, ch10_p, ch14_q, ch18_q, chr3_q, ch7_p, ch10_q, ch15_q, ch19_p, ch4_p, ch7_q, ch11_p, ch16_q, ch19_q; wherein the character chr and the following digits represent chromosome numbers, q represents long arms, and p represents short arms;

fourth marker: comparing the duty ratio of the cfDNA fragments in all base sequences of n bp respectively at the upstream and downstream of the breakpoint on the reference genome;

fifth marker: the cfDNA fragments of different species were aligned to the ratio of m base fragments to the 5' end of the reference genome in all base fragments.

The first marker is obtained through the following steps: dividing a reference genome into a plurality of windows, and respectively obtaining copy number data in different windows on chromosome 1-22 in WGS data; the window size is 0.8-1.2M.

The second marker is obtained through the following steps: the reference genome is divided into a plurality of windows, and the proportion of cfDNA in all cfDNA fragments in each window is counted compared with that of the short reading and the long reading in the window.

The third marker is obtained through the following steps: cfDNA fragments were aligned to a reference genome with long and short arms on each chromosome as regional ranges, respectively, and the number of reads in gradient intervals of different lengths within each range was obtained.

The fourth marker is obtained through the following steps: comparing cfDNA fragment data results to a reference genome to obtain the position of the 5' end of the read on the reference genome; obtaining sequence data of n bp bases at the upstream and downstream of the position as a base fragment set; the fourth feature set was the ratio of the various base fragments obtained to the total fragments.

The fifth marker is obtained through the following steps: the m base data of the 5' end of the cfDNA fragment is used as a base fragment set, and the proportion of various base fragments in all fragments is obtained.

n is 4 and m is 8.

The method for constructing the malignant breast cancer screening model is used for classifying malignant breast cancer and benign breast nodules of a sample and comprises the following steps of:

step 1, extracting cfDNA from samples of a malignant breast cancer patient and a control group (benign nodule patient) and sequencing to obtain cfDNA fragmentation information;

step 2, dividing a reference genome into a plurality of windows, and respectively obtaining copy number data in the range of each window as a first characteristic value;

step 3, comparing the result of the read data to a reference genome, dividing the reference genome into a plurality of windows, and respectively obtaining the short read quantity ratio and the long read quantity ratio in the range of each window, wherein the base length of the short read is 100-150bp, and the base length of the long read is 151-220bp;

step 4, comparing the read data result to a reference genome, taking a long arm and a short arm on each chromosome as regional ranges, and obtaining the number of reads in gradient intervals with different lengths in each range as a third feature set; the gradient intervals with different lengths are gradient ranges with different lengths, which are obtained by increasing the gradient ranges with the step length of 4-5bp in the range of 100-220 bp; the long and short arms are selected from the following chromosome arms:

step 5, comparing the read data result to a reference genome to obtain the position of the 5' end of the read on the reference genome; obtaining sequence data of n bp bases at the upstream and downstream of the position as a base fragment set; taking the proportion of the obtained various base fragments in all fragments as a fourth characteristic set;

step 6, taking m base data of the 5' end in the read data as a base fragment set, and obtaining the proportion of various base fragments in all fragments as a fifth characteristic set;

and 7, taking the first, second, third, fourth and fifth feature sets as initial feature values, inputting the initial feature values as model feature vectors into a classification model, and training the model by taking the classification results of the malignant breast cancer and the benign nodule as output values to obtain a malignant breast cancer screening model.

The window in the step 2 is obtained by dividing the reference gene No. 1-22 chromosomes in a non-overlapping way with the length of 0.8-1.2 Mb.

The step 3 includes:

step 3-1, dividing a reference genome into a plurality of windows according to the length of 5 Mb;

step 3-2, counting the proportion of the cfDNA fragments in all the cfDNA fragments in each window relative to the cfDNA fragments of the short reading segment and the long reading segment in the window;

the number of reads described in step 4 is normalized.

In the step 5, n is 4;

in the step 6, m is 8;

in the step 7, inputting the first, second, third, fourth and fifth feature sets into the generalized linear regression model, the gradient lifting model, the random forest, the deep learning neural network model and the extreme gradient lifting model respectively to obtain a plurality of sub-models, and combining the sub-models into a linear relation model.

In the process of obtaining a plurality of sub-models, the sub-models obtained by screening are applied to the linear relation model after screening according to the classification performance of each sub-model of the first, second, third, fourth and fifth feature sets.

The beneficial effects of the invention are as follows: and carrying out statistics and analysis on the fragment length proportion, copy number change and fragment distribution of WGS cfDNA of 98 malignant breast cancer patients and 93 benign nodule patients, and training and integrating a deep learning neural network model through automatic machine learning by utilizing a generalized linear regression model, a gradient lifting model, a random forest model and an extreme gradient lifting model to obtain a final model. The invention screens malignant breast cancer based on the fragmentation result of high-throughput low-depth sequencing of the cfDNA of the blood plasma for the first time. Compared with the existing analysis and detection method, the model has higher sensitivity, can more effectively classify malignant breast tumors and benign nodules, and reduces unnecessary operation risks and complications risks.

Drawings

FIG. 1 is a schematic diagram of a model building process;

FIG. 2 is a variance distribution diagram of the most diverse of the individual features

FIG. 3 is a graph of AUC curves for various features in a training set and final integrated model

FIG. 4 is a graph of AUC of a final integrated model in a validation set

FIG. 5 is a graph of the predicted score distribution in the final model for validating concentrated benign nodules and malignant breast tumors

Description of the embodiments

The calculation method in the invention is detailed as follows:

the invention firstly needs the steps of cfDNA extraction, library establishment, sequencing and the like from a blood sample. The extraction and library-building method is not particularly limited, and can be adjusted from the extraction methods in the prior art, and the base information of cfDNA can be obtained by using the sequencing technology in the prior art in the sequencing process. The reference genome in the present invention is in version hg 19.

The purpose of the model in this patent is to distinguish between malignant breast tumors (malignant breast cancer) and benign nodules (benign nodes), classifying the samples. In the training process, the patients judged to be benign nodules according to the subsequent postoperative pathology are taken as a control group, and the patients judged to be malignant breast cancer are taken as positive.

The data set used in the model construction process of the invention is as follows:

TABLE 1

Extraction and sequencing method of plasma cfDNA sample:

the patient was subjected to liquid biopsy, a 10ml whole blood sample was collected from the patient using a purple blood collection tube (EDTA anticoagulant tube), and the plasma was centrifuged in time (within 2 hours) and transferred to laboratory analysis at-80 degrees celsius under frozen storage. After transport to the laboratory, the plasma samples were subjected to ctDNA extraction using QIAGEN plasma DNA extraction kit according to the instructions. After pooling the collected ctDNA samples, 5-fold sequencing of WGS was performed. After the off-machine data were obtained, the data were aligned to the human reference genome (hg 19 edition) to obtain base data information for the corresponding reads.

The model building process of the patent is mainly as follows:

step 1, extracting cfDNA from samples of a positive group and a control group and sequencing to obtain cfDNA fragmentation data;

step 2, dividing a reference genome into a plurality of windows, and respectively obtaining copy number data in the range of each window as a first characteristic;

and 3, comparing the read data result to a reference genome, dividing the reference genome into a plurality of windows, obtaining the positions of the windows on the reference genome, and obtaining the lengths of the cfDNA fragments. Taking the proportion of the cfDNA short reading segment and the cfDNA long reading segment in all fragments in the window as a second characteristic value;

step 4, calculating the fragment coverage of 5bp as a read length in the range of 100bp to 220bp on the level of each chromosome arm as a third characteristic.

Step 5, the frequency of base combination at the breakpoint of the DNA fragment is used as a fourth characteristic;

step 6, the frequency of occurrence of base combinations at the ends of the DNA fragments is used as a fifth feature;

and 7, inputting model feature vectors of samples of the positive group and the control group into a first layer model, selecting a best model corresponding to each feature, and carrying out averaging integration to obtain a final model output result.

The characteristic values of the patent are five, and are respectively described in detail as follows:

window copy number variation 1Mb (1 Mb-Bin Copy Number Variation, CNV)

Copy number changes have a high degree of correlation with individual cancers, although it has been possible to distinguish by detecting copy number changes in some cancer-associated genes or specific genomic intervals, there are other rare or unknown genes or intervals that can provide information on potential copy number changes.

The copy data collection method comprises the following steps: firstly, collecting WGS data of 30 healthy people, dividing reference gene No. 1-22 chromosomes with a length of 1Mb in a non-overlapping manner, calculating the reading depth in each window by using bedtools coverage for each sample, correcting according to the GC content and average comparison capability record (UCSC bigWig file) of each window, taking the median depth of 30 healthy people in each window as a representative, and obtaining a population comparison baseline of 2475 window reading depths; for each sample to be tested, 2475 window individual read depth information is also obtained, and the copy number change logarithm of each window, namely log2 (depth after the sample to be tested is corrected and homogenized/depth after the group baseline is corrected and homogenized) is constructed by utilizing a hidden Markov model (Hidden Markov Model, HMM) and the group reference base line depth of each window, so that the copy number change information of each sample to be tested is obtained.

2. cfDNA DNA fragmentation length ratio difference (Fragment size ratio, FSR)

For the DNA fragment size duty cycle, it reflects the duty cycle characteristic of the length size of the cfDNA read. Machine learning with DNA fragment size coverage depth (fragmentation size ratio) builds a predictive model to distinguish between malignant breast cancer and benign breast nodules. By comparing the lengths of cfDNA reads of malignant breast cancer and benign breast nodules, it was found that there was a difference in the distribution of the number of fragments between 100-150bp,151-220bp and 100-220bp on the chromosome, which could be used as a distinguishing feature.

cfDNA read length data was obtained by the following method: in the aligned BAM, the mass, length and alignment position information of each read was recorded, and the human reference genome was selected from the hg19 sequence provided by the university of California Kruz division (University of California, santa Cruz, UCSC). Human reference genome was cut into 541 windows according to 5Mb length, and the total number of reads (100-220 bp), the number of short reads (100-150 bp) and the number of long reads (151-220 bp) in each window were counted, respectively. And respectively carrying out standardized conversion on each reading number according to the counting results of various reading numbers in all windows, namely, standardized value= (original value-average value)/standard deviation. A set of numbers of reads of different lengths (short as well as long) of 1082 (541 x 2 = 1082) sets is thus obtained. In the characteristic value data set, the read data duty ratio in different length ranges in each window is calculated according to the number of the reads, and the proportion value is calculated by the number of cfDNA in the corresponding length range/the number of all cfDNA reads in the window.

cfDNA fragmentation size distribution (Fragment Size Distribution, FSD)

On the basis of the obtained size ratio of the DNA fragments, 39 areas of the long and short arms of each chromosome of the human reference genome are used as windows for obtaining high-resolution reading results, and the windows are as follows:

TABLE 2

chr1_p	chr4_q	chr8_p	chr11_q	chr16_q	chr20_p
						chr1_q	chr5_p	chr8_q	chr12_p	chr17_p	chr20_q
chr2_p	chr5_q	chr9_p	chr12_q	chr17_q	chr21_q
						chr2_q	chr6_p	chr9_q	chr13_q	chr18_p	chr22_q
chr3_p	chr6_q	chr10_p	chr14_q	chr18_q
						chr3_q	chr7_p	chr10_q	chr15_q	chr19_p
chr4_p	chr7_q	chr11_p	chr16_p	chr19_q

Fragments of 100-220bp are increased by 5bp, 24 length gradients (for example, 100-104bp and 105-109bp … … on the 1q arm of chr 1) are divided, the number of fragments of each length gradient in each long and short arm window is counted, and standardized conversion is carried out, so that 936 characteristic results (936=39×24 length gradient standardized results) are obtained in total according to the size distribution result of the high-resolution DNA fragments.

4. cfDNA BreakPoint sequence (BreakPoint Motif, BPM)

The human reference genome is a DNA duplex structure, linked by base-complementary pairing hydrogen-dissociating bonds; in the normal aging and cancer progression process, the pH value of the surrounding environment of the cell changes, so that complementary hydrogen bonds of bases are destroyed, and breakage occurs; after the DNA enters the blood circulation, non-random fragmentation of the DNA occurs. The process may be related to tissue origin, disease state, nucleosome opening and endonuclease activity. Because of the base sequence at the break, the sequence ratio of the information comprising the sequence at the different break points will also be different. The collecting method comprises the following steps: in the compared bam, the basic information of each read and the compared position are recorded, 4bp sequences around the breakpoint of the human reference genome sequence coordinate of the 5' end of each read are confirmed, the number of reads of 8bp length sequences (total 4 times 8=65536) at each breakpoint is counted, and therefore the sequence read proportion at 65536 breakpoints, for example, the AAAAAAAAAA read proportion=AAAAAAAA number of reads/total number of sequence reads at all breakpoints is calculated.

cfDNA terminal sequence (End Motif, EDM)

After alignment, the 5' -end 8bp sequence in each read is obtained, and the number of reads of each terminal sequence (total 4 times 8=65536) is counted, so that the 65536 terminal sequence read ratio, for example, AAAAAAAA sequence ratio=aaaaaaaaaa number of reads/total number of all terminal sequence reads is calculated.

Through the above data acquisition, initial data vectors of the five types of data can be obtained, respectively. Next, the corresponding calculation method is redesigned:

1. generalized linear regression model (Generalized Logistic Regression, glm)

The generalized linear regression model is a common algorithm in machine learning, and aims to overcome the defects of the linear regression model, solve the problem of discrete dependent variables which cannot be processed by the common linear regression model, and is popularization of the linear regression model. He establishes a linking function, which is a linear and nonlinear bridge, by establishing a bridge between the linear prediction result and the value of the dependent variable y.

2. Gradient lifting model (Gradient Boosting GBM)

In each iteration of the gradient lifting model algorithm, firstly, calculating the negative gradient of the current model on all samples, then training a new weak classifier by taking the value as a target to fit, calculating the weight of the weak classifier, and finally, updating the model.

3. Random forest (random forest)

Random forests are a powerful classification and regression tool for the case of high dimensionality and multiple collinearity. When a group of data sets are provided, the random forest can randomly extract part of information to generate a group of decision forest which helps classification or regression, and node splitting attribute is made, and the random extraction is continuously repeated until the tree cannot be split any more; and finally, combining all the split attribute results to obtain a final prediction result.

4. Deep learning neural network model neural network (Deep Learning Neural network)

A neural network consists of inputs, weights, deviations or thresholds, and outputs, with any single node being activated if its output is above a specified threshold, and data is sent to the next layer of the network. Each node of the input layer performs point-to-point calculation with each node of the hidden layer, and a weighted summation and activation method is applied. Each value calculated using the hidden layer is calculated using the same method, and the output layer. The method has the advantages of high classification accuracy, strong parallel distribution processing capability, and strong distribution storage and learning capability.

5. Extreme gradient lifting (extreme gradient boosting, xgboost)

Is an optimization algorithm based on an integrated ideological addition model of a gradient lifting decision tree (Gradient Boosting Decision Tree, GBDT). The method utilizes a second-order Taylor formula to develop, optimizes a loss function, improves calculation accuracy, utilizes a regularization term to simplify a model, avoids overfitting, adopts a Blocks storage structure, and can perform parallel calculation.

In addition, the patent also adopts a random search super-parameter (Random Grid Search Parameters) algorithm for optimizing the model. Random search is a common method of machine learning super-parametric optimization. The random search is to randomly extract parameter values from a specific model parameter range and select an optimal parameter combination from a plurality of sampled parameter values. Instead of trying all possible combinations, the method is to choose a specific number of random combinations of one random value for each super parameter. Compared with the model tuning by using manual tuning and network searching, the random searching can achieve a better effect by using fewer searching times, and provides a more efficient solution (particularly under the condition of a large number of parameters).

In the optimization and parameter adjustment process of the model, the super parameters of five algorithms used in the patent are shown in the following table:

TABLE 3 Table 3

Algorithm (Algorithm)	Model super parameters (Hyperparameters)
		Generalized linear regression model (GLM)	alpha {0.0,0.2,0.4,0.6,0.8,1.0}
Extreme gradient lifting model （XGBoost）	max_depth {3,4,5,6,7,8,9,10,15,20};min_rows {0.01,0.1,1.0,3.0,5.0,10.0,15.0, 20.0}min_child_weight {3,5,10,15,20}
		Random Forest (Random Forest)	max_depth {3,4,5,6,7,8,9,10,11,12,13,14,15,16,17}min_rows {1,5,10,15,30,100} ntrees: 10000
Deep learning neural network model （Deep Learning）	epsilon {1e-6,1e-7,1e-8,1e-9}hidden {20},{50},{100}rho {0.9, 0.95, 0.99}
		Gradient lifting model (GBM)	max_depth{3,4,5,6,7,8,9,10}min_rows{1,5,10,15,30,100}nbins{10,20,40,60}

After the five initial data of 98 cases of malignant breast cancers and 93 cases of benign breast nodules are obtained, taking a Copy Number Variation (CNV) statistical result as an input value, classifying a malignant breast tumor sample and a benign breast nodule sample through five classification models respectively, and carrying out parameter and structure variation on the five models through random search super parameters in the screening process respectively to be used as a sub-model for training data and suggesting the model, and then selecting three optimal sub-models of the feature, wherein the AUC curve of a training set of the models is used as an index of classification effect in the screening process; similarly, by collecting cfDNA Fragment Size Ratios (FSR) of malignant breast tumors and benign breast nodules, cfDNA Fragment Size Distribution (FSD), breakpoint sequences (BPM) and end sequences (EDM) were also used as input values, respectively, and three optimal sub-models were selected for each feature (specific model optimization procedure is the same as above), and the calculation results of 3 x5=15 models were obtained in total through the above calculation procedure. In each calculation, a contribution value of each feature vector to the classification result may be obtained. The 3 optimal models (total 15 models) selected for each feature are shown in the following table:

the feature variables before the contribution value row of the optimal model selected by each feature and the contribution values are as follows:

copy Number Variation (CNV) extreme gradient lifting XGBoost model:

TABLE 4 Table 4

	Variable(s)	Contribution value		Variable(s)	Contribution value
						1	Cnv.22.46000001. 47000000	1	21	Cnv.3.50000001.5 1000000	0.192185063
2	Cnv.4.176000001. 177000000	0.707718729	22	Cnv.4.135000001. 136000000	0.187070758
						3	Cnv.4.103000001. 104000000	0.702344457	23	Cnv.12.97000001. 98000000	0.159003193
4	Cnv.6.132000001. 133000000	0.603610479	24	Cnv.7.82000001.8 3000000	0.158473368
						5	Cnv.22.48000001. 49000000	0.584821318	25	Cnv.10.126000001 .127000000	0.153133441
6	Cnv.3.101000001. 102000000	0.51322448	26	Cnv.8.33000001.3 4000000	0.152661605
						7	Cnv.3.153000001. 154000000	0.497560161	27	Cnv.22.29000001. 30000000	0.131411155
8	Cnv.13.75000001. 76000000	0.480732668	28	Cnv.5.122000001. 123000000	0.128729099
						9	Cnv.12.76000001. 77000000	0.353319757	29	Cnv.6.3000001.40 00000	0.128214895
10	Cnv.9.134000001. 135000000	0.344604821	30	Cnv.1.241000001. 242000000	0.1216052
						11	Cnv.2.129000001. 130000000	0.329638899	31	Cnv.12.82000001. 83000000	0.118804964
12	Cnv.18.34000001. 35000000	0.267532225	32	Cnv.13.105000001 .106000000	0.114712761
						13	Cnv.8.110000001. 111000000	0.26669307	33	Cnv.7.5000001.60 00000	0.114293264
14	Cnv.3.80000001.8 1000000	0.256606013	34	Cnv.8.10000001.1 1000000	0.105616978
						15	Cnv.16.56000001. 57000000	0.253101489	35	Cnv.3.189000001. 190000000	0.105581721
16	Cnv.3.21000001.2 2000000	0.232077932	36	Cnv.11.97000001. 98000000	0.102099504
						17	Cnv.16.50000001. 51000000	0.22288311	37	Cnv.9.107000001. 108000000	0.09950168
18	Cnv.3.41000001.4 2000000	0.211229986	38	Cnv.19.34000001. 35000000	0.0989202
						19	Cnv.6.81000001.8 2000000	0.202457945	39	Cnv.3.52000001.5 3000000	0.089593942
20	Cnv.15.62000001. 63000000	0.201399251	40	Cnv.18.35000001. 36000000	0.085212451
						21	Cnv.3.50000001.5 1000000	0.192185063
22	Cnv.4.135000001. 136000000	0.187070758
						23	Cnv.12.97000001. 98000000	0.159003193
24	Cnv.7.82000001.8 3000000	0.158473368
						25	Cnv.10.126000001 .127000000	0.153133441

cfDNA Fragment Size Ratio (FSR) extreme gradient lifting XGBoost model:

TABLE 5

	Variable(s)	Contribution value		Variable(s)	Contribution value
						1	Frag.longA408	1	26	Frag.longA60	0.09980968
2	Frag.shortA64	0.85819829	27	Frag.shortA251	0.09972169
						3	Frag.longA46	0.64443052	28	Frag.longA535	0.0921953
4	Frag.longA102	0.63926766	29	Frag.longA523	0.0898759
						5	Frag.longA223	0.42440395	30	Frag.longA237	0.08636216
6	Frag.longA316	0.29105056	31	Frag.longA44	0.08317273
						7	Frag.longA30	0.25139644	32	Frag.shortA227	0.08238297
8	Frag.longA101	0.24885936	33	Frag.longA492	0.07812515
						9	Frag.shortA346	0.24481562	34	Frag.longA71	0.07647141
10	Frag.longA248	0.23781	35	Frag.longA257	0.0744764
						11	Frag.longA32	0.19572478	36	Frag.longA389	0.07397167
12	Frag.shortA511	0.19031787	37	Frag.shortA360	0.07290724
						13	Frag.longA163	0.16107737	38	Frag.longA430	0.06958323
14	Frag.shortA310	0.15044681	39	Frag.shortA87	0.06900854
						15	Frag.shortA146	0.13785492	40	Frag.shortA312	0.06695638
16	Frag.shortA491	0.1351144	41	Frag.longA108	0.06349707
						17	Frag.longA185	0.1294817	42	Frag.shortA389	0.06096496
18	Frag.longA130	0.12876509	43	Frag.shortA35	0.05931402
						19	Frag.shortA408	0.12708398	44	Frag.shortA61	0.05915703
20	Frag.shortA332	0.12464014	45	Frag.shortA393	0.05727665
						21	Frag.shortA253	0.12012323	46	Frag.shortA353	0.05415344
22	Frag.longA245	0.11298206	47	Frag.longA195	0.0530392
						23	Frag.longA219	0.10198909	48	Frag.shortA63	0.05250434
24	Frag.shortA196	0.10106005	49	Frag.longA517	0.0520624
						25	Frag.longA208	0.0999192	50	Frag.shortA361	0.05163735

cfDNA Fragment Size Distribution (FSD) deep learning neural network regression model (DeepLearning, NN):

TABLE 6

	Variable(s)	Contribution value		Variable(s)	Contribution value
						1	FragArm.chr19.19 p.frag.200.204	1	26	FragArm.chr22.22 q.frag.215.219	0.62650544
2	FragArm.chr19.19 q.frag.205.209	0.93421996	27	FragArm.chr7.7q. frag.170.174	0.62340617
						3	FragArm.chr17.17 q.frag.170.174	0.84517437	28	FragArm.chr3.3p. frag.170.174	0.621714
4	FragArm.chr11.11 q.frag.170.174	0.74109721	29	FragArm.chr21.21 q.frag.215.219	0.61096567
						5	FragArm.chr8.8p. frag.215.219	0.72974157	30	FragArm.chr20.20 p.frag.210.214	0.60664117
6	FragArm.chr18.18 q.frag.170.174	0.72518045	31	FragArm.chr7.7p. frag.170.174	0.6036374
						7	FragArm.chr4.4q. frag.170.174	0.71589434	32	FragArm.chr9.9q. frag.215.219	0.60263228
8	FragArm.chr22.22 q.frag.170.174	0.71454889	33	FragArm.chr9.9q. frag.170.174	0.59463716
						9	FragArm.chr8.8q. frag.170.174	0.71383041	34	FragArm.chr19.19 p.frag.205.209	0.58072054
10	FragArm.chr15.15 q.frag.170.174	0.70367897	35	FragArm.chr17.17 p.frag.200.204	0.57559198
						11	FragArm.chr6.6p. frag.170.174	0.70319629	36	FragArm.chr16.16 q.frag.215.219	0.57427329
12	FragArm.chr18.18 p.frag.175.179	0.69913715	37	FragArm.chr2.2p. frag.170.174	0.57368487
						13	FragArm.chr20.20 p.frag.175.179	0.68919247	38	FragArm.chr13.13 q.frag.170.174	0.57236755
14	FragArm.chr19.19 q.frag.170.174	0.68403781	39	FragArm.chr20.20 p.frag.205.209	0.57026112
						15	FragArm.chr19.19 q.frag.210.214	0.67318714	40	FragArm.chr1.1q. frag.170.174	0.56910765
16	FragArm.chr9.9p. frag.215.219	0.67183381	41	FragArm.chr10.10 p.frag.170.174	0.56614232
						17	FragArm.chr12.12 p.frag.170.174	0.64841783	42	FragArm.chr14.14 q.frag.170.174	0.56131285
18	FragArm.chr1.1p. frag.170.174	0.63953185	43	FragArm.chr8.8p. frag.175.179	0.5551706
						19	FragArm.chr20.20 q.frag.215.219	0.6361028	44	FragArm.chr5.5p. frag.175.179	0.55327171
20	FragArm.chr12.12 p.frag.215.219	0.63554609	45	FragArm.chr19.19 q.frag.175.179	0.55095059
						21	FragArm.chr6.6q. frag.170.174	0.63494736	46	FragArm.chr12.12 q.frag.170.174	0.55088931
22	FragArm.chr17.17 p.frag.170.174	0.63375968	47	FragArm.chr10.10 q.frag.170.174	0.54725403
						23	FragArm.chr2.2q. frag.170.174	0.63122767	48	FragArm.chr10.10 p.frag.215.219	0.54440355
24	FragArm.chr3.3q. frag.170.174	0.62818843	49	FragArm.chr18.18 p.frag.195.199	0.53650242
						25	FragArm.chr5.5p. frag.215.219	0.62724286	50	FragArm.chr19.19 p.frag.175.179	0.53099901

Breakpoint sequence deep learning neural network regression model (DeepLearning, NN):

TABLE 7

	Variable(s)	Contribution value		Variable(s)	Contribution value
						1	BPM_ACGAAGTT	1	26	BPM_AGAAGTAC	0.66628772
2	BPM_CAATTATA	0.94607401	27	BPM_TAACGCGC	0.66143578
						3	BPM_AGCGGTTC	0.89684391	28	BPM_GTGCGTAA	0.6593492
4	BPM_CCGGATCT	0.8741132	29	BPM_TCGTATCT	0.65734679
						5	BPM_GACTCGCG	0.85113877	30	BPM_CCGTAACA	0.65716744
6	BPM_TCCATGCA	0.81111783	31	BPM_AAAAGGTC	0.65640622
						7	BPM_GTGCAAAT	0.8035053	32	BPM_GCCGCGGT	0.6516785
8	BPM_TCGACGGA	0.79976958	33	BPM_ATAAGGGC	0.64762968
						9	BPM_CGGCACGG	0.78045821	34	BPM_TTCGTTTA	0.64635307
10	BPM_ATCCGTAA	0.76044983	35	BPM_GCGGCCGG	0.64323455
						11	BPM_GGCGTGCC	0.75822753	36	BPM_TCCGTTCT	0.64271921
12	BPM_CCGGAACG	0.73871589	37	BPM_ATGCGAAG	0.64210856
						13	BPM_CAAAACTA	0.72939718	38	BPM_GCTGAGCA	0.6320973
14	BPM_TATAGTTA	0.71431983	39	BPM_TGATTATA	0.62828749
						15	BPM_AGCACAAT	0.71318734	40	BPM_TACTTGCC	0.62744111
16	BPM_GTTCCGGG	0.71131843	41	BPM_AAACCCCC	0.62479603
						17	BPM_GGCTTGAA	0.70888007	42	BPM_ATCCCCGT	0.61880386
18	BPM_AACGTTCG	0.7037878	43	BPM_CATAGGAA	0.61735392
						19	BPM_TCGTGCGG	0.70090812	44	BPM_GTGCTCGT	0.61727566
20	BPM_AACGACCC	0.69668245	45	BPM_TCCGAAAA	0.6157372
						21	BPM_CCGCGGAT	0.69300968	46	BPM_TCGGCGAT	0.6151548
22	BPM_TGTATCCT	0.67933434	47	BPM_CTCGTCCC	0.6127463
						23	BPM_ATCTTTCC	0.67813677	48	BPM_TTCGGTTT	0.61045146
24	BPM_TGCGAGTC	0.67347133	49	BPM_TAAAGTTA	0.60864198
						25	BPM_ACGTCTTG	0.6698823	50	BPM_TATCGCCC	0.60788792

End sequence deep learning neural network regression model (DeepLearning, NN):

TABLE 8

	Variable(s)	Contribution value		Variable(s)	Contribution value
						1	EDM_GAGTCGAT	1	26	EDM_CGTACGCG	0.67282635
2	EDM_CAGCCGCT	0.94303036	27	EDM_CTAACGTA	0.67080379
						3	EDM_AGCGTTAC	0.89571661	28	EDM_GGGATATG	0.66796774
4	EDM_GAACGTAT	0.82257056	29	EDM_TGTACCTT	0.66630644
						5	EDM_CGTGCTAG	0.78153014	30	EDM_GCGATAGA	0.66531879
6	EDM_GGTGATAA	0.73706114	31	EDM_GCATTCGG	0.6649462
						7	EDM_GGATCGGG	0.73158205	32	EDM_ACGATTCT	0.66256285
8	EDM_AACGACGT	0.72405159	33	EDM_AGGCGCTA	0.65348101
						9	EDM_TAACGAGT	0.72331977	34	EDM_ATCCAACG	0.64768285
10	EDM_CTATATAA	0.72320271	35	EDM_CTCGTGTT	0.64099795
						11	EDM_GTTCCGAA	0.72225255	36	EDM_ATATTGCC	0.63993442
12	EDM_GCGCTATC	0.71595263	37	EDM_CAGTCAAG	0.63803053
						13	EDM_ACGAACGA	0.71322638	38	EDM_GCGAAGCG	0.63759154
14	EDM_TCGACATA	0.69361341	39	EDM_TCCTGTGG	0.63716823
						15	EDM_ACCTCGCC	0.69169796	40	EDM_ACTCTCTC	0.63659197
16	EDM_CACCGGAT	0.69112372	41	EDM_GGCGATCA	0.63594854
						17	EDM_CGTATCGG	0.69073415	42	EDM_CCCCCCTG	0.63546211
18	EDM_GGGTTGCA	0.69031698	43	EDM_TCGTGCCA	0.63408917
						19	EDM_GACCGGCG	0.68690753	44	EDM_TCCCTACT	0.63252074
20	EDM_GTACGTCC	0.68384075	45	EDM_GCCGTGAC	0.63248158
						21	EDM_GGTGGACA	0.68310016	46	EDM_CGTCGCTG	0.63233876
22	EDM_GGCGCGAG	0.67621911	47	EDM_GATTCGCT	0.63176686
						23	EDM_AGGTTCTC	0.67594147	48	EDM_CAATGCCC	0.63152003
24	EDM_CGGGTATA	0.67563164	49	EDM_TTAGTCGT	0.63035995
						25	EDM_GAGGTATT	0.67491972	50	EDM_CAAATCCT	0.63023627

The 15 training models were converted into the final linear equation: ALLSTacked= (CNVmode 1+CNVmode 2+CNVmode 3+FSRmode 1+FSRmode 2+FSRmode 3+FSDmode 1 +: FSDmodel 2+FSDmodel 3+BPMmodel 1+BPMmodel 2+BPMmodel 3+EDMmodel 1+EDMmodel 2+EDMmodel 3)/15.

TABLE 9

Model	Model basic parameters
		CNV_1	Max_depth = {9}, Min_rows = {15}, Min_child_weight = {15}
CNV_2	Max_depth = {9}, Min_rows = {15}, Min_child_weight = {15}
		CNV_3	Max_depth = {15}, Min_rows = {15}, Min_child_weight = {15}
FSR_1	Max_depth = {3}, Min_rows = {5}, Min_child_weight = {5}
		FSR_2	epsilon = 1e-6, , rho = {0.9}, hidden = {100}
FSR_3	Max_depth = {12}, Min_rows = {3}, Min_child_weight = {3}
		FSD_1	epsilon = 1e-6, rho = {0.95}, hidden = {100}
FSD_2	epsilon = 1e-6, rho = {0.9}, hidden = {50}
		FSD_3	epsilon = 1e-6, rho = {0.95}, hidden = {20}
BPM_1	epsilon = 1e-6, rho = {0.9}, hidden = {50}
		BPM_2	epsilon = 1e-6, rho = {0.95}, hidden = {20}
BPM_3	epsilon = 1e-6, rho = {0.9}, hidden = {50}
		EDM_1	epsilon = 1e-7, rho = {0.95}, hidden = {50}
EDM_2	epsilon = 1e-6, rho = {0.95}, hidden = {50}
		EDM_3	epsilon = 1e-9, rho = {0.9}, hidden = {100}

And carrying out secondary aggregation on the 15 single-feature sub-models to obtain the average value of the results of each sub-model, wherein the prediction effect of the secondary aggregation model is improved compared with that of the single-feature model, 90% sensitivity of a training set is adopted as a model to predict the section values of benign breast nodules and malignant breast tumors, and finally the prediction effect AUC of the training set reaches 91.2%, the sensitivity of the training set is 90%, and the specificity of the training set is 76.3%. The predicted effect AUC for the validation set was 89.3%, validation set specificity was 85.9% and validation set sensitivity was 89.8%.

Claims

1. The application of the gene marker in preparing an early screening detection reagent for malignant breast cancer and benign breast nodules is characterized in that the gene marker comprises the following components:

a first marker: copy number in different windows on chromosomes in WGS data;

third marker: comparing cfDNA fragments to the number of reads in different length gradient intervals on the long and short arms of the reference genome; the gradient intervals with different lengths are gradient ranges with different lengths, which are obtained by increasing the gradient ranges with 5bp step sizes in the range of 100-220 bp; the long and short arms are selected from the following chromosome arms:

fifth marker: comparing the ratios of the cfDNA fragments of different types to the m base fragments at the 5' end of the reference genome in all base fragments;

the first marker is obtained through the following steps: dividing a reference genome into a plurality of windows, and respectively obtaining copy number data in different windows on chromosome 1-22 in WGS data; the window size is 0.8-1.2M;

the second marker is obtained through the following steps: dividing a reference genome into a plurality of windows, and respectively counting the proportion of cfDNA fragments in all the windows, wherein the proportion is smaller than that of cfDNA fragments in all the windows;

the third marker is obtained through the following steps: comparing cfDNA fragments to a reference genome, taking a long arm and a short arm on each chromosome as regional ranges, and obtaining the number of reads in gradient intervals with different lengths in each range;

the fourth marker is obtained through the following steps: comparing cfDNA fragment data results to a reference genome to obtain the position of the 5' end of the read on the reference genome; obtaining sequence data of n bp bases at the upstream and downstream of the position as a base fragment set; taking the proportion of the obtained various base fragments in all fragments as a fourth characteristic set;

the fifth marker is obtained through the following steps: taking m base data of the 5' end of the cfDNA fragment as a base fragment set, and obtaining the proportion of various base fragments in all fragments; n is 4 and m is 8.

2. The method for constructing the malignant breast cancer screening model is characterized in that the model is used for classifying malignant breast cancer and benign breast nodules of a sample and comprises the following steps of:

step 1, extracting cfDNA from samples of a malignant breast cancer patient and a control group and sequencing to obtain cfDNA fragmentation information;

step 4, comparing the read data result to a reference genome, taking a long arm and a short arm on each chromosome as regional ranges, and obtaining the number of reads in gradient intervals with different lengths in each range as a third feature set; the gradient intervals with different lengths are gradient ranges with different lengths, which are obtained by increasing the gradient ranges with 5bp step sizes in the range of 100-220 bp; the long and short arms are selected from the following chromosome arms:

step 7, taking the first, second, third, fourth and fifth feature sets as initial feature values together, inputting the initial feature values as model feature vectors into a classification model, and training the model by taking the classification results of malignant breast cancer and benign nodules as output values to obtain a malignant breast cancer screening model;

the window in the step 2 is obtained by dividing the reference gene No. 1-22 chromosomes in a non-overlapping way with the length of 0.8-1.2 Mb;

the step 3 includes:

the number of the read sections in the step 4 is subjected to standardized treatment;

in the step 5, n is 4; in the step 6, m is 8;

in the step 7, inputting the first, second, third, fourth and fifth feature sets into a generalized linear regression model, a gradient lifting model, a random forest, a deep learning neural network model and an extreme gradient lifting model respectively to obtain a plurality of sub-models, and combining the sub-models into a linear relation model;