CN116153420B - Application of gene marker in early screening of malignant breast cancer and benign breast nodule and construction method of screening model - Google Patents

Application of gene marker in early screening of malignant breast cancer and benign breast nodule and construction method of screening model Download PDF

Info

Publication number
CN116153420B
CN116153420B CN202310446774.6A CN202310446774A CN116153420B CN 116153420 B CN116153420 B CN 116153420B CN 202310446774 A CN202310446774 A CN 202310446774A CN 116153420 B CN116153420 B CN 116153420B
Authority
CN
China
Prior art keywords
model
fragments
reference genome
cfdna
base
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310446774.6A
Other languages
Chinese (zh)
Other versions
CN116153420A (en
Inventor
邵阳
吴雪
包华
刘睿
吴舒雨
唐皖湘夫
唐诗婷
刘思思
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Shihe Medical Devices Co ltd
Nanjing Shihe Gene Biotechnology Co ltd
Original Assignee
Nanjing Shihe Medical Devices Co ltd
Nanjing Shihe Gene Biotechnology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Shihe Medical Devices Co ltd, Nanjing Shihe Gene Biotechnology Co ltd filed Critical Nanjing Shihe Medical Devices Co ltd
Priority to CN202310446774.6A priority Critical patent/CN116153420B/en
Publication of CN116153420A publication Critical patent/CN116153420A/en
Application granted granted Critical
Publication of CN116153420B publication Critical patent/CN116153420B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6876Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
    • C12Q1/6883Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material
    • C12Q1/6886Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material for cancer
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/10Ploidy or copy number detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/30ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for calculating health indices; for individual health risk assessment
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2600/00Oligonucleotides characterized by their use
    • C12Q2600/112Disease subtyping, staging or classification
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2600/00Oligonucleotides characterized by their use
    • C12Q2600/158Expression markers
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Abstract

The invention relates to an application of a gene marker in early screening of malignant breast cancer and benign breast nodules and a construction method of a screening model, wherein a liquid biopsy whole genome WGS (high-performance sequencing system) low-depth sequencing is carried out on a blood plasma sample cfDNA, a window copy number variation analysis (CNV), a DNA fragmentation distribution difference (FSD), a DNA fragment length ratio difference (FSR), a DNA breakpoint sequence (BPM) and a DNA end sequence (EDM) are used for constructing an integrated model of a multi-feature multi-algorithm by utilizing automatic machine learning, so that the purpose of noninvasive accurate diagnosis of breast cancer is realized.

Description

Application of gene marker in early screening of malignant breast cancer and benign breast nodule and construction method of screening model
Technical Field
The invention relates to early screening of malignant breast cancer and benign breast nodules, and belongs to the field of molecular biomedicine.
Background
Breast cancer is the most common among women worldwide and the most common type of cancer that causes death in women. According to global cancer report of GLOBOCAN2020, 226 thousands of new breast cancers account for 11.7% of the total cancer incidence in 2020, become global first malignant cancers, seriously threaten the physical and mental health of women and affect the quality of life. Research shows that early breast cancer screening can discover breast cancer earlier, and survival rate and quality of life are improved. Currently, the most common screening methods for Breast cancer are mainly Breast ultrasound (Breast Ultrasound), nuclear magnetic resonance (Breast MRI), automated Breast ultrasound systems (Automated Breast Ultrasound System, ABUS) and Mammography (Mammogram). However, each of the existing prior art techniques has some drawbacks, in that the quality of the examination of breast ultrasound techniques depends to some extent on the experience of the operator, patient compliance with nuclear magnetic resonance techniques is not high, and automated breast ultrasound systems are costly. Mammography, which has the highest rate of use, is currently the primary screening method for early breast cancer, but has a difference in sensitivity of detection for different breast types of patients. For example, in younger women, mammography has lower screening accuracy, and for women over 50 years old, since fibroglandular tissue of the breast becomes adipose tissue to replace with age, abnormal lesions near the adipose tissue are more easily detected by mammography, and screening accuracy becomes higher. Screening sensitivity has a certain relationship with age. In addition, for extremely dense chest forms (Almost entirely dense tissue) which account for about 10% of the total, there is the disadvantage of overdiagnosis and low sensitivity. Studies have shown that the AUC for the detection using mammography is 0.79 and that for the detection using breast ultrasound is 0.78. The sensitivity of the image detection of the breast cancer is limited to a certain extent, and the image detection is used as the basis for diagnosing the breast tumor, so that the risk of unnecessary invasive surgery is increased, and therefore, development of effective, practical and high-sensitivity screening means suitable for a wide population is urgently needed to carry out auxiliary screening on the population with high risk in the image detection diagnosis.
Disclosure of Invention
The invention provides a method for carrying out WGS sequencing on a blood plasma sample cfDNA, carrying out copy number change (copy number variation, CNV) of a characteristic difference 1Mb window of malignant breast cancer and benign nodule through a high-throughput sequencing result, carrying out DNA fragmentation distribution (Fragment size distribution, FSD), analyzing a DNA fragmentation length proportion (Fragment size ratio, FSR), a breakpoint sequence (Break Point Motif, BPM) and an End sequence (End Point Motif, EDM), and respectively modeling by utilizing a Generalized Linear Model (GLM), gradient Boost (GBM), random Forest (RF), deep Learning (DL) and extreme Gradient Boost (extreme Gradient boosting, xgboost), finally integrating a multi-feature multi-algorithm through a method for taking an average value thereof, obtaining a final risk coefficient and classifying, thereby realizing the purpose of noninvasive accurate diagnosis of malignant breast cancer.
Use of a genetic marker in the early screening of malignant breast cancer and benign breast nodules, said genetic marker comprising:
a first marker: copy number in different windows on chromosomes in WGS data;
a second marker: comparing the cfDNA fragments to the short read number ratio and the long read number ratio in different windows of the reference genome; the base length of the short reading segment is 100-150bp, and the base length of the long reading segment is 151-220bp;
third marker: comparing cfDNA fragments to the number of reads in different length gradient intervals on the long and short arms of the reference genome; the gradient intervals with different lengths are gradient ranges with different lengths, which are obtained by increasing the gradient ranges with the step length of 4-5bp in the range of 100-220 bp; the long and short arms are selected from the following chromosome arms:
chr1_p, ch4_q, chr8_p, ch11_q, chr16_q, ch20_p, ch1_q, ch5_p, chr8_q, chr12_p, ch17_p, ch20_q, ch2_p, ch5_q, ch9_p, ch12_q, ch17_q, ch21_q, ch2_q, ch6_p, ch9_q, ch13_q, ch18_p, ch22_q, chr3_p, ch6_q, ch10_p, ch14_q, ch18_q, chr3_q, ch7_p, ch10_q, ch15_q, ch19_p, ch4_p, ch7_q, ch11_p, ch16_q, ch19_q; wherein the character chr and the following digits represent chromosome numbers, q represents long arms, and p represents short arms;
fourth marker: comparing the duty ratio of the cfDNA fragments in all base sequences of n bp respectively at the upstream and downstream of the breakpoint on the reference genome;
fifth marker: the cfDNA fragments of different species were aligned to the ratio of m base fragments to the 5' end of the reference genome in all base fragments.
The first marker is obtained through the following steps: dividing a reference genome into a plurality of windows, and respectively obtaining copy number data in different windows on chromosome 1-22 in WGS data; the window size is 0.8-1.2M.
The second marker is obtained through the following steps: the reference genome is divided into a plurality of windows, and the proportion of cfDNA in all cfDNA fragments in each window is counted compared with that of the short reading and the long reading in the window.
The third marker is obtained through the following steps: cfDNA fragments were aligned to a reference genome with long and short arms on each chromosome as regional ranges, respectively, and the number of reads in gradient intervals of different lengths within each range was obtained.
The fourth marker is obtained through the following steps: comparing cfDNA fragment data results to a reference genome to obtain the position of the 5' end of the read on the reference genome; obtaining sequence data of n bp bases at the upstream and downstream of the position as a base fragment set; the fourth feature set was the ratio of the various base fragments obtained to the total fragments.
The fifth marker is obtained through the following steps: the m base data of the 5' end of the cfDNA fragment is used as a base fragment set, and the proportion of various base fragments in all fragments is obtained.
n is 4 and m is 8.
The method for constructing the malignant breast cancer screening model is used for classifying malignant breast cancer and benign breast nodules of a sample and comprises the following steps of:
step 1, extracting cfDNA from samples of a malignant breast cancer patient and a control group (benign nodule patient) and sequencing to obtain cfDNA fragmentation information;
step 2, dividing a reference genome into a plurality of windows, and respectively obtaining copy number data in the range of each window as a first characteristic value;
step 3, comparing the result of the read data to a reference genome, dividing the reference genome into a plurality of windows, and respectively obtaining the short read quantity ratio and the long read quantity ratio in the range of each window, wherein the base length of the short read is 100-150bp, and the base length of the long read is 151-220bp;
step 4, comparing the read data result to a reference genome, taking a long arm and a short arm on each chromosome as regional ranges, and obtaining the number of reads in gradient intervals with different lengths in each range as a third feature set; the gradient intervals with different lengths are gradient ranges with different lengths, which are obtained by increasing the gradient ranges with the step length of 4-5bp in the range of 100-220 bp; the long and short arms are selected from the following chromosome arms:
chr1_p, ch4_q, chr8_p, ch11_q, chr16_q, ch20_p, ch1_q, ch5_p, chr8_q, chr12_p, ch17_p, ch20_q, ch2_p, ch5_q, ch9_p, ch12_q, ch17_q, ch21_q, ch2_q, ch6_p, ch9_q, ch13_q, ch18_p, ch22_q, chr3_p, ch6_q, ch10_p, ch14_q, ch18_q, chr3_q, ch7_p, ch10_q, ch15_q, ch19_p, ch4_p, ch7_q, ch11_p, ch16_q, ch19_q; wherein the character chr and the following digits represent chromosome numbers, q represents long arms, and p represents short arms;
step 5, comparing the read data result to a reference genome to obtain the position of the 5' end of the read on the reference genome; obtaining sequence data of n bp bases at the upstream and downstream of the position as a base fragment set; taking the proportion of the obtained various base fragments in all fragments as a fourth characteristic set;
step 6, taking m base data of the 5' end in the read data as a base fragment set, and obtaining the proportion of various base fragments in all fragments as a fifth characteristic set;
and 7, taking the first, second, third, fourth and fifth feature sets as initial feature values, inputting the initial feature values as model feature vectors into a classification model, and training the model by taking the classification results of the malignant breast cancer and the benign nodule as output values to obtain a malignant breast cancer screening model.
The window in the step 2 is obtained by dividing the reference gene No. 1-22 chromosomes in a non-overlapping way with the length of 0.8-1.2 Mb.
The step 3 includes:
step 3-1, dividing a reference genome into a plurality of windows according to the length of 5 Mb;
step 3-2, counting the proportion of the cfDNA fragments in all the cfDNA fragments in each window relative to the cfDNA fragments of the short reading segment and the long reading segment in the window;
the number of reads described in step 4 is normalized.
In the step 5, n is 4;
in the step 6, m is 8;
in the step 7, inputting the first, second, third, fourth and fifth feature sets into the generalized linear regression model, the gradient lifting model, the random forest, the deep learning neural network model and the extreme gradient lifting model respectively to obtain a plurality of sub-models, and combining the sub-models into a linear relation model.
In the process of obtaining a plurality of sub-models, the sub-models obtained by screening are applied to the linear relation model after screening according to the classification performance of each sub-model of the first, second, third, fourth and fifth feature sets.
The beneficial effects of the invention are as follows: and carrying out statistics and analysis on the fragment length proportion, copy number change and fragment distribution of WGS cfDNA of 98 malignant breast cancer patients and 93 benign nodule patients, and training and integrating a deep learning neural network model through automatic machine learning by utilizing a generalized linear regression model, a gradient lifting model, a random forest model and an extreme gradient lifting model to obtain a final model. The invention screens malignant breast cancer based on the fragmentation result of high-throughput low-depth sequencing of the cfDNA of the blood plasma for the first time. Compared with the existing analysis and detection method, the model has higher sensitivity, can more effectively classify malignant breast tumors and benign nodules, and reduces unnecessary operation risks and complications risks.
Drawings
FIG. 1 is a schematic diagram of a model building process;
FIG. 2 is a variance distribution diagram of the most diverse of the individual features
FIG. 3 is a graph of AUC curves for various features in a training set and final integrated model
FIG. 4 is a graph of AUC of a final integrated model in a validation set
FIG. 5 is a graph of the predicted score distribution in the final model for validating concentrated benign nodules and malignant breast tumors
Description of the embodiments
The calculation method in the invention is detailed as follows:
the invention firstly needs the steps of cfDNA extraction, library establishment, sequencing and the like from a blood sample. The extraction and library-building method is not particularly limited, and can be adjusted from the extraction methods in the prior art, and the base information of cfDNA can be obtained by using the sequencing technology in the prior art in the sequencing process. The reference genome in the present invention is in version hg 19.
The purpose of the model in this patent is to distinguish between malignant breast tumors (malignant breast cancer) and benign nodules (benign nodes), classifying the samples. In the training process, the patients judged to be benign nodules according to the subsequent postoperative pathology are taken as a control group, and the patients judged to be malignant breast cancer are taken as positive.
The data set used in the model construction process of the invention is as follows:
TABLE 1
Extraction and sequencing method of plasma cfDNA sample:
the patient was subjected to liquid biopsy, a 10ml whole blood sample was collected from the patient using a purple blood collection tube (EDTA anticoagulant tube), and the plasma was centrifuged in time (within 2 hours) and transferred to laboratory analysis at-80 degrees celsius under frozen storage. After transport to the laboratory, the plasma samples were subjected to ctDNA extraction using QIAGEN plasma DNA extraction kit according to the instructions. After pooling the collected ctDNA samples, 5-fold sequencing of WGS was performed. After the off-machine data were obtained, the data were aligned to the human reference genome (hg 19 edition) to obtain base data information for the corresponding reads.
The model building process of the patent is mainly as follows:
step 1, extracting cfDNA from samples of a positive group and a control group and sequencing to obtain cfDNA fragmentation data;
step 2, dividing a reference genome into a plurality of windows, and respectively obtaining copy number data in the range of each window as a first characteristic;
and 3, comparing the read data result to a reference genome, dividing the reference genome into a plurality of windows, obtaining the positions of the windows on the reference genome, and obtaining the lengths of the cfDNA fragments. Taking the proportion of the cfDNA short reading segment and the cfDNA long reading segment in all fragments in the window as a second characteristic value;
step 4, calculating the fragment coverage of 5bp as a read length in the range of 100bp to 220bp on the level of each chromosome arm as a third characteristic.
Step 5, the frequency of base combination at the breakpoint of the DNA fragment is used as a fourth characteristic;
step 6, the frequency of occurrence of base combinations at the ends of the DNA fragments is used as a fifth feature;
and 7, inputting model feature vectors of samples of the positive group and the control group into a first layer model, selecting a best model corresponding to each feature, and carrying out averaging integration to obtain a final model output result.
The characteristic values of the patent are five, and are respectively described in detail as follows:
window copy number variation 1Mb (1 Mb-Bin Copy Number Variation, CNV)
Copy number changes have a high degree of correlation with individual cancers, although it has been possible to distinguish by detecting copy number changes in some cancer-associated genes or specific genomic intervals, there are other rare or unknown genes or intervals that can provide information on potential copy number changes.
The copy data collection method comprises the following steps: firstly, collecting WGS data of 30 healthy people, dividing reference gene No. 1-22 chromosomes with a length of 1Mb in a non-overlapping manner, calculating the reading depth in each window by using bedtools coverage for each sample, correcting according to the GC content and average comparison capability record (UCSC bigWig file) of each window, taking the median depth of 30 healthy people in each window as a representative, and obtaining a population comparison baseline of 2475 window reading depths; for each sample to be tested, 2475 window individual read depth information is also obtained, and the copy number change logarithm of each window, namely log2 (depth after the sample to be tested is corrected and homogenized/depth after the group baseline is corrected and homogenized) is constructed by utilizing a hidden Markov model (Hidden Markov Model, HMM) and the group reference base line depth of each window, so that the copy number change information of each sample to be tested is obtained.
2. cfDNA DNA fragmentation length ratio difference (Fragment size ratio, FSR)
For the DNA fragment size duty cycle, it reflects the duty cycle characteristic of the length size of the cfDNA read. Machine learning with DNA fragment size coverage depth (fragmentation size ratio) builds a predictive model to distinguish between malignant breast cancer and benign breast nodules. By comparing the lengths of cfDNA reads of malignant breast cancer and benign breast nodules, it was found that there was a difference in the distribution of the number of fragments between 100-150bp,151-220bp and 100-220bp on the chromosome, which could be used as a distinguishing feature.
cfDNA read length data was obtained by the following method: in the aligned BAM, the mass, length and alignment position information of each read was recorded, and the human reference genome was selected from the hg19 sequence provided by the university of California Kruz division (University of California, santa Cruz, UCSC). Human reference genome was cut into 541 windows according to 5Mb length, and the total number of reads (100-220 bp), the number of short reads (100-150 bp) and the number of long reads (151-220 bp) in each window were counted, respectively. And respectively carrying out standardized conversion on each reading number according to the counting results of various reading numbers in all windows, namely, standardized value= (original value-average value)/standard deviation. A set of numbers of reads of different lengths (short as well as long) of 1082 (541 x 2 = 1082) sets is thus obtained. In the characteristic value data set, the read data duty ratio in different length ranges in each window is calculated according to the number of the reads, and the proportion value is calculated by the number of cfDNA in the corresponding length range/the number of all cfDNA reads in the window.
cfDNA fragmentation size distribution (Fragment Size Distribution, FSD)
On the basis of the obtained size ratio of the DNA fragments, 39 areas of the long and short arms of each chromosome of the human reference genome are used as windows for obtaining high-resolution reading results, and the windows are as follows:
TABLE 2
chr1_p chr4_q chr8_p chr11_q chr16_q chr20_p
chr1_q chr5_p chr8_q chr12_p chr17_p chr20_q
chr2_p chr5_q chr9_p chr12_q chr17_q chr21_q
chr2_q chr6_p chr9_q chr13_q chr18_p chr22_q
chr3_p chr6_q chr10_p chr14_q chr18_q
chr3_q chr7_p chr10_q chr15_q chr19_p
chr4_p chr7_q chr11_p chr16_p chr19_q
Fragments of 100-220bp are increased by 5bp, 24 length gradients (for example, 100-104bp and 105-109bp … … on the 1q arm of chr 1) are divided, the number of fragments of each length gradient in each long and short arm window is counted, and standardized conversion is carried out, so that 936 characteristic results (936=39×24 length gradient standardized results) are obtained in total according to the size distribution result of the high-resolution DNA fragments.
4. cfDNA BreakPoint sequence (BreakPoint Motif, BPM)
The human reference genome is a DNA duplex structure, linked by base-complementary pairing hydrogen-dissociating bonds; in the normal aging and cancer progression process, the pH value of the surrounding environment of the cell changes, so that complementary hydrogen bonds of bases are destroyed, and breakage occurs; after the DNA enters the blood circulation, non-random fragmentation of the DNA occurs. The process may be related to tissue origin, disease state, nucleosome opening and endonuclease activity. Because of the base sequence at the break, the sequence ratio of the information comprising the sequence at the different break points will also be different. The collecting method comprises the following steps: in the compared bam, the basic information of each read and the compared position are recorded, 4bp sequences around the breakpoint of the human reference genome sequence coordinate of the 5' end of each read are confirmed, the number of reads of 8bp length sequences (total 4 times 8=65536) at each breakpoint is counted, and therefore the sequence read proportion at 65536 breakpoints, for example, the AAAAAAAAAA read proportion=AAAAAAAA number of reads/total number of sequence reads at all breakpoints is calculated.
cfDNA terminal sequence (End Motif, EDM)
After alignment, the 5' -end 8bp sequence in each read is obtained, and the number of reads of each terminal sequence (total 4 times 8=65536) is counted, so that the 65536 terminal sequence read ratio, for example, AAAAAAAA sequence ratio=aaaaaaaaaa number of reads/total number of all terminal sequence reads is calculated.
Through the above data acquisition, initial data vectors of the five types of data can be obtained, respectively. Next, the corresponding calculation method is redesigned:
1. generalized linear regression model (Generalized Logistic Regression, glm)
The generalized linear regression model is a common algorithm in machine learning, and aims to overcome the defects of the linear regression model, solve the problem of discrete dependent variables which cannot be processed by the common linear regression model, and is popularization of the linear regression model. He establishes a linking function, which is a linear and nonlinear bridge, by establishing a bridge between the linear prediction result and the value of the dependent variable y.
2. Gradient lifting model (Gradient Boosting GBM)
In each iteration of the gradient lifting model algorithm, firstly, calculating the negative gradient of the current model on all samples, then training a new weak classifier by taking the value as a target to fit, calculating the weight of the weak classifier, and finally, updating the model.
3. Random forest (random forest)
Random forests are a powerful classification and regression tool for the case of high dimensionality and multiple collinearity. When a group of data sets are provided, the random forest can randomly extract part of information to generate a group of decision forest which helps classification or regression, and node splitting attribute is made, and the random extraction is continuously repeated until the tree cannot be split any more; and finally, combining all the split attribute results to obtain a final prediction result.
4. Deep learning neural network model neural network (Deep Learning Neural network)
A neural network consists of inputs, weights, deviations or thresholds, and outputs, with any single node being activated if its output is above a specified threshold, and data is sent to the next layer of the network. Each node of the input layer performs point-to-point calculation with each node of the hidden layer, and a weighted summation and activation method is applied. Each value calculated using the hidden layer is calculated using the same method, and the output layer. The method has the advantages of high classification accuracy, strong parallel distribution processing capability, and strong distribution storage and learning capability.
5. Extreme gradient lifting (extreme gradient boosting, xgboost)
Is an optimization algorithm based on an integrated ideological addition model of a gradient lifting decision tree (Gradient Boosting Decision Tree, GBDT). The method utilizes a second-order Taylor formula to develop, optimizes a loss function, improves calculation accuracy, utilizes a regularization term to simplify a model, avoids overfitting, adopts a Blocks storage structure, and can perform parallel calculation.
In addition, the patent also adopts a random search super-parameter (Random Grid Search Parameters) algorithm for optimizing the model. Random search is a common method of machine learning super-parametric optimization. The random search is to randomly extract parameter values from a specific model parameter range and select an optimal parameter combination from a plurality of sampled parameter values. Instead of trying all possible combinations, the method is to choose a specific number of random combinations of one random value for each super parameter. Compared with the model tuning by using manual tuning and network searching, the random searching can achieve a better effect by using fewer searching times, and provides a more efficient solution (particularly under the condition of a large number of parameters).
In the optimization and parameter adjustment process of the model, the super parameters of five algorithms used in the patent are shown in the following table:
TABLE 3 Table 3
Algorithm (Algorithm) Model super parameters (Hyperparameters)
Generalized linear regression model (GLM) alpha {0.0,0.2,0.4,0.6,0.8,1.0}
Extreme gradient lifting model (XGBoost) max_depth {3,4,5,6,7,8,9,10,15,20};min_rows {0.01,0.1,1.0,3.0,5.0,10.0,15.0, 20.0}min_child_weight {3,5,10,15,20}
Random Forest (Random Forest) max_depth {3,4,5,6,7,8,9,10,11,12,13,14,15,16,17}min_rows {1,5,10,15,30,100} ntrees: 10000
Deep learning neural network model (Deep Learning) epsilon {1e-6,1e-7,1e-8,1e-9}hidden {20},{50},{100}rho {0.9, 0.95, 0.99}
Gradient lifting model (GBM) max_depth{3,4,5,6,7,8,9,10}min_rows{1,5,10,15,30,100}nbins{10,20,40,60}
After the five initial data of 98 cases of malignant breast cancers and 93 cases of benign breast nodules are obtained, taking a Copy Number Variation (CNV) statistical result as an input value, classifying a malignant breast tumor sample and a benign breast nodule sample through five classification models respectively, and carrying out parameter and structure variation on the five models through random search super parameters in the screening process respectively to be used as a sub-model for training data and suggesting the model, and then selecting three optimal sub-models of the feature, wherein the AUC curve of a training set of the models is used as an index of classification effect in the screening process; similarly, by collecting cfDNA Fragment Size Ratios (FSR) of malignant breast tumors and benign breast nodules, cfDNA Fragment Size Distribution (FSD), breakpoint sequences (BPM) and end sequences (EDM) were also used as input values, respectively, and three optimal sub-models were selected for each feature (specific model optimization procedure is the same as above), and the calculation results of 3 x5=15 models were obtained in total through the above calculation procedure. In each calculation, a contribution value of each feature vector to the classification result may be obtained. The 3 optimal models (total 15 models) selected for each feature are shown in the following table:
the feature variables before the contribution value row of the optimal model selected by each feature and the contribution values are as follows:
copy Number Variation (CNV) extreme gradient lifting XGBoost model:
TABLE 4 Table 4
Variable(s) Contribution value Variable(s) Contribution value
1 Cnv.22.46000001. 47000000 1 21 Cnv.3.50000001.5 1000000 0.192185063
2 Cnv.4.176000001. 177000000 0.707718729 22 Cnv.4.135000001. 136000000 0.187070758
3 Cnv.4.103000001. 104000000 0.702344457 23 Cnv.12.97000001. 98000000 0.159003193
4 Cnv.6.132000001. 133000000 0.603610479 24 Cnv.7.82000001.8 3000000 0.158473368
5 Cnv.22.48000001. 49000000 0.584821318 25 Cnv.10.126000001 .127000000 0.153133441
6 Cnv.3.101000001. 102000000 0.51322448 26 Cnv.8.33000001.3 4000000 0.152661605
7 Cnv.3.153000001. 154000000 0.497560161 27 Cnv.22.29000001. 30000000 0.131411155
8 Cnv.13.75000001. 76000000 0.480732668 28 Cnv.5.122000001. 123000000 0.128729099
9 Cnv.12.76000001. 77000000 0.353319757 29 Cnv.6.3000001.40 00000 0.128214895
10 Cnv.9.134000001. 135000000 0.344604821 30 Cnv.1.241000001. 242000000 0.1216052
11 Cnv.2.129000001. 130000000 0.329638899 31 Cnv.12.82000001. 83000000 0.118804964
12 Cnv.18.34000001. 35000000 0.267532225 32 Cnv.13.105000001 .106000000 0.114712761
13 Cnv.8.110000001. 111000000 0.26669307 33 Cnv.7.5000001.60 00000 0.114293264
14 Cnv.3.80000001.8 1000000 0.256606013 34 Cnv.8.10000001.1 1000000 0.105616978
15 Cnv.16.56000001. 57000000 0.253101489 35 Cnv.3.189000001. 190000000 0.105581721
16 Cnv.3.21000001.2 2000000 0.232077932 36 Cnv.11.97000001. 98000000 0.102099504
17 Cnv.16.50000001. 51000000 0.22288311 37 Cnv.9.107000001. 108000000 0.09950168
18 Cnv.3.41000001.4 2000000 0.211229986 38 Cnv.19.34000001. 35000000 0.0989202
19 Cnv.6.81000001.8 2000000 0.202457945 39 Cnv.3.52000001.5 3000000 0.089593942
20 Cnv.15.62000001. 63000000 0.201399251 40 Cnv.18.35000001. 36000000 0.085212451
21 Cnv.3.50000001.5 1000000 0.192185063
22 Cnv.4.135000001. 136000000 0.187070758
23 Cnv.12.97000001. 98000000 0.159003193
24 Cnv.7.82000001.8 3000000 0.158473368
25 Cnv.10.126000001 .127000000 0.153133441
cfDNA Fragment Size Ratio (FSR) extreme gradient lifting XGBoost model:
TABLE 5
Variable(s) Contribution value Variable(s) Contribution value
1 Frag.longA408 1 26 Frag.longA60 0.09980968
2 Frag.shortA64 0.85819829 27 Frag.shortA251 0.09972169
3 Frag.longA46 0.64443052 28 Frag.longA535 0.0921953
4 Frag.longA102 0.63926766 29 Frag.longA523 0.0898759
5 Frag.longA223 0.42440395 30 Frag.longA237 0.08636216
6 Frag.longA316 0.29105056 31 Frag.longA44 0.08317273
7 Frag.longA30 0.25139644 32 Frag.shortA227 0.08238297
8 Frag.longA101 0.24885936 33 Frag.longA492 0.07812515
9 Frag.shortA346 0.24481562 34 Frag.longA71 0.07647141
10 Frag.longA248 0.23781 35 Frag.longA257 0.0744764
11 Frag.longA32 0.19572478 36 Frag.longA389 0.07397167
12 Frag.shortA511 0.19031787 37 Frag.shortA360 0.07290724
13 Frag.longA163 0.16107737 38 Frag.longA430 0.06958323
14 Frag.shortA310 0.15044681 39 Frag.shortA87 0.06900854
15 Frag.shortA146 0.13785492 40 Frag.shortA312 0.06695638
16 Frag.shortA491 0.1351144 41 Frag.longA108 0.06349707
17 Frag.longA185 0.1294817 42 Frag.shortA389 0.06096496
18 Frag.longA130 0.12876509 43 Frag.shortA35 0.05931402
19 Frag.shortA408 0.12708398 44 Frag.shortA61 0.05915703
20 Frag.shortA332 0.12464014 45 Frag.shortA393 0.05727665
21 Frag.shortA253 0.12012323 46 Frag.shortA353 0.05415344
22 Frag.longA245 0.11298206 47 Frag.longA195 0.0530392
23 Frag.longA219 0.10198909 48 Frag.shortA63 0.05250434
24 Frag.shortA196 0.10106005 49 Frag.longA517 0.0520624
25 Frag.longA208 0.0999192 50 Frag.shortA361 0.05163735
cfDNA Fragment Size Distribution (FSD) deep learning neural network regression model (DeepLearning, NN):
TABLE 6
Variable(s) Contribution value Variable(s) Contribution value
1 FragArm.chr19.19 p.frag.200.204 1 26 FragArm.chr22.22 q.frag.215.219 0.62650544
2 FragArm.chr19.19 q.frag.205.209 0.93421996 27 FragArm.chr7.7q. frag.170.174 0.62340617
3 FragArm.chr17.17 q.frag.170.174 0.84517437 28 FragArm.chr3.3p. frag.170.174 0.621714
4 FragArm.chr11.11 q.frag.170.174 0.74109721 29 FragArm.chr21.21 q.frag.215.219 0.61096567
5 FragArm.chr8.8p. frag.215.219 0.72974157 30 FragArm.chr20.20 p.frag.210.214 0.60664117
6 FragArm.chr18.18 q.frag.170.174 0.72518045 31 FragArm.chr7.7p. frag.170.174 0.6036374
7 FragArm.chr4.4q. frag.170.174 0.71589434 32 FragArm.chr9.9q. frag.215.219 0.60263228
8 FragArm.chr22.22 q.frag.170.174 0.71454889 33 FragArm.chr9.9q. frag.170.174 0.59463716
9 FragArm.chr8.8q. frag.170.174 0.71383041 34 FragArm.chr19.19 p.frag.205.209 0.58072054
10 FragArm.chr15.15 q.frag.170.174 0.70367897 35 FragArm.chr17.17 p.frag.200.204 0.57559198
11 FragArm.chr6.6p. frag.170.174 0.70319629 36 FragArm.chr16.16 q.frag.215.219 0.57427329
12 FragArm.chr18.18 p.frag.175.179 0.69913715 37 FragArm.chr2.2p. frag.170.174 0.57368487
13 FragArm.chr20.20 p.frag.175.179 0.68919247 38 FragArm.chr13.13 q.frag.170.174 0.57236755
14 FragArm.chr19.19 q.frag.170.174 0.68403781 39 FragArm.chr20.20 p.frag.205.209 0.57026112
15 FragArm.chr19.19 q.frag.210.214 0.67318714 40 FragArm.chr1.1q. frag.170.174 0.56910765
16 FragArm.chr9.9p. frag.215.219 0.67183381 41 FragArm.chr10.10 p.frag.170.174 0.56614232
17 FragArm.chr12.12 p.frag.170.174 0.64841783 42 FragArm.chr14.14 q.frag.170.174 0.56131285
18 FragArm.chr1.1p. frag.170.174 0.63953185 43 FragArm.chr8.8p. frag.175.179 0.5551706
19 FragArm.chr20.20 q.frag.215.219 0.6361028 44 FragArm.chr5.5p. frag.175.179 0.55327171
20 FragArm.chr12.12 p.frag.215.219 0.63554609 45 FragArm.chr19.19 q.frag.175.179 0.55095059
21 FragArm.chr6.6q. frag.170.174 0.63494736 46 FragArm.chr12.12 q.frag.170.174 0.55088931
22 FragArm.chr17.17 p.frag.170.174 0.63375968 47 FragArm.chr10.10 q.frag.170.174 0.54725403
23 FragArm.chr2.2q. frag.170.174 0.63122767 48 FragArm.chr10.10 p.frag.215.219 0.54440355
24 FragArm.chr3.3q. frag.170.174 0.62818843 49 FragArm.chr18.18 p.frag.195.199 0.53650242
25 FragArm.chr5.5p. frag.215.219 0.62724286 50 FragArm.chr19.19 p.frag.175.179 0.53099901
Breakpoint sequence deep learning neural network regression model (DeepLearning, NN):
TABLE 7
Variable(s) Contribution value Variable(s) Contribution value
1 BPM_ACGAAGTT 1 26 BPM_AGAAGTAC 0.66628772
2 BPM_CAATTATA 0.94607401 27 BPM_TAACGCGC 0.66143578
3 BPM_AGCGGTTC 0.89684391 28 BPM_GTGCGTAA 0.6593492
4 BPM_CCGGATCT 0.8741132 29 BPM_TCGTATCT 0.65734679
5 BPM_GACTCGCG 0.85113877 30 BPM_CCGTAACA 0.65716744
6 BPM_TCCATGCA 0.81111783 31 BPM_AAAAGGTC 0.65640622
7 BPM_GTGCAAAT 0.8035053 32 BPM_GCCGCGGT 0.6516785
8 BPM_TCGACGGA 0.79976958 33 BPM_ATAAGGGC 0.64762968
9 BPM_CGGCACGG 0.78045821 34 BPM_TTCGTTTA 0.64635307
10 BPM_ATCCGTAA 0.76044983 35 BPM_GCGGCCGG 0.64323455
11 BPM_GGCGTGCC 0.75822753 36 BPM_TCCGTTCT 0.64271921
12 BPM_CCGGAACG 0.73871589 37 BPM_ATGCGAAG 0.64210856
13 BPM_CAAAACTA 0.72939718 38 BPM_GCTGAGCA 0.6320973
14 BPM_TATAGTTA 0.71431983 39 BPM_TGATTATA 0.62828749
15 BPM_AGCACAAT 0.71318734 40 BPM_TACTTGCC 0.62744111
16 BPM_GTTCCGGG 0.71131843 41 BPM_AAACCCCC 0.62479603
17 BPM_GGCTTGAA 0.70888007 42 BPM_ATCCCCGT 0.61880386
18 BPM_AACGTTCG 0.7037878 43 BPM_CATAGGAA 0.61735392
19 BPM_TCGTGCGG 0.70090812 44 BPM_GTGCTCGT 0.61727566
20 BPM_AACGACCC 0.69668245 45 BPM_TCCGAAAA 0.6157372
21 BPM_CCGCGGAT 0.69300968 46 BPM_TCGGCGAT 0.6151548
22 BPM_TGTATCCT 0.67933434 47 BPM_CTCGTCCC 0.6127463
23 BPM_ATCTTTCC 0.67813677 48 BPM_TTCGGTTT 0.61045146
24 BPM_TGCGAGTC 0.67347133 49 BPM_TAAAGTTA 0.60864198
25 BPM_ACGTCTTG 0.6698823 50 BPM_TATCGCCC 0.60788792
End sequence deep learning neural network regression model (DeepLearning, NN):
TABLE 8
Variable(s) Contribution value Variable(s) Contribution value
1 EDM_GAGTCGAT 1 26 EDM_CGTACGCG 0.67282635
2 EDM_CAGCCGCT 0.94303036 27 EDM_CTAACGTA 0.67080379
3 EDM_AGCGTTAC 0.89571661 28 EDM_GGGATATG 0.66796774
4 EDM_GAACGTAT 0.82257056 29 EDM_TGTACCTT 0.66630644
5 EDM_CGTGCTAG 0.78153014 30 EDM_GCGATAGA 0.66531879
6 EDM_GGTGATAA 0.73706114 31 EDM_GCATTCGG 0.6649462
7 EDM_GGATCGGG 0.73158205 32 EDM_ACGATTCT 0.66256285
8 EDM_AACGACGT 0.72405159 33 EDM_AGGCGCTA 0.65348101
9 EDM_TAACGAGT 0.72331977 34 EDM_ATCCAACG 0.64768285
10 EDM_CTATATAA 0.72320271 35 EDM_CTCGTGTT 0.64099795
11 EDM_GTTCCGAA 0.72225255 36 EDM_ATATTGCC 0.63993442
12 EDM_GCGCTATC 0.71595263 37 EDM_CAGTCAAG 0.63803053
13 EDM_ACGAACGA 0.71322638 38 EDM_GCGAAGCG 0.63759154
14 EDM_TCGACATA 0.69361341 39 EDM_TCCTGTGG 0.63716823
15 EDM_ACCTCGCC 0.69169796 40 EDM_ACTCTCTC 0.63659197
16 EDM_CACCGGAT 0.69112372 41 EDM_GGCGATCA 0.63594854
17 EDM_CGTATCGG 0.69073415 42 EDM_CCCCCCTG 0.63546211
18 EDM_GGGTTGCA 0.69031698 43 EDM_TCGTGCCA 0.63408917
19 EDM_GACCGGCG 0.68690753 44 EDM_TCCCTACT 0.63252074
20 EDM_GTACGTCC 0.68384075 45 EDM_GCCGTGAC 0.63248158
21 EDM_GGTGGACA 0.68310016 46 EDM_CGTCGCTG 0.63233876
22 EDM_GGCGCGAG 0.67621911 47 EDM_GATTCGCT 0.63176686
23 EDM_AGGTTCTC 0.67594147 48 EDM_CAATGCCC 0.63152003
24 EDM_CGGGTATA 0.67563164 49 EDM_TTAGTCGT 0.63035995
25 EDM_GAGGTATT 0.67491972 50 EDM_CAAATCCT 0.63023627
The 15 training models were converted into the final linear equation: ALLSTacked= (CNVmode 1+CNVmode 2+CNVmode 3+FSRmode 1+FSRmode 2+FSRmode 3+FSDmode 1 +: FSDmodel 2+FSDmodel 3+BPMmodel 1+BPMmodel 2+BPMmodel 3+EDMmodel 1+EDMmodel 2+EDMmodel 3)/15.
TABLE 9
Model Model basic parameters
CNV_1 Max_depth = {9}, Min_rows = {15}, Min_child_weight = {15}
CNV_2 Max_depth = {9}, Min_rows = {15}, Min_child_weight = {15}
CNV_3 Max_depth = {15}, Min_rows = {15}, Min_child_weight = {15}
FSR_1 Max_depth = {3}, Min_rows = {5}, Min_child_weight = {5}
FSR_2 epsilon = 1e-6, , rho = {0.9}, hidden = {100}
FSR_3 Max_depth = {12}, Min_rows = {3}, Min_child_weight = {3}
FSD_1 epsilon = 1e-6, rho = {0.95}, hidden = {100}
FSD_2 epsilon = 1e-6, rho = {0.9}, hidden = {50}
FSD_3 epsilon = 1e-6, rho = {0.95}, hidden = {20}
BPM_1 epsilon = 1e-6, rho = {0.9}, hidden = {50}
BPM_2 epsilon = 1e-6, rho = {0.95}, hidden = {20}
BPM_3 epsilon = 1e-6, rho = {0.9}, hidden = {50}
EDM_1 epsilon = 1e-7, rho = {0.95}, hidden = {50}
EDM_2 epsilon = 1e-6, rho = {0.95}, hidden = {50}
EDM_3 epsilon = 1e-9, rho = {0.9}, hidden = {100}
And carrying out secondary aggregation on the 15 single-feature sub-models to obtain the average value of the results of each sub-model, wherein the prediction effect of the secondary aggregation model is improved compared with that of the single-feature model, 90% sensitivity of a training set is adopted as a model to predict the section values of benign breast nodules and malignant breast tumors, and finally the prediction effect AUC of the training set reaches 91.2%, the sensitivity of the training set is 90%, and the specificity of the training set is 76.3%. The predicted effect AUC for the validation set was 89.3%, validation set specificity was 85.9% and validation set sensitivity was 89.8%.

Claims (2)

1. The application of the gene marker in preparing an early screening detection reagent for malignant breast cancer and benign breast nodules is characterized in that the gene marker comprises the following components:
a first marker: copy number in different windows on chromosomes in WGS data;
a second marker: comparing the cfDNA fragments to the short read number ratio and the long read number ratio in different windows of the reference genome; the base length of the short reading segment is 100-150bp, and the base length of the long reading segment is 151-220bp;
third marker: comparing cfDNA fragments to the number of reads in different length gradient intervals on the long and short arms of the reference genome; the gradient intervals with different lengths are gradient ranges with different lengths, which are obtained by increasing the gradient ranges with 5bp step sizes in the range of 100-220 bp; the long and short arms are selected from the following chromosome arms:
chr1_p, ch4_q, chr8_p, ch11_q, chr16_q, ch20_p, ch1_q, ch5_p, chr8_q, chr12_p, ch17_p, ch20_q, ch2_p, ch5_q, ch9_p, ch12_q, ch17_q, ch21_q, ch2_q, ch6_p, ch9_q, ch13_q, ch18_p, ch22_q, chr3_p, ch6_q, ch10_p, ch14_q, ch18_q, chr3_q, ch7_p, ch10_q, ch15_q, ch19_p, ch4_p, ch7_q, ch11_p, ch16_q, ch19_q; wherein the character chr and the following digits represent chromosome numbers, q represents long arms, and p represents short arms;
fourth marker: comparing the duty ratio of the cfDNA fragments in all base sequences of n bp respectively at the upstream and downstream of the breakpoint on the reference genome;
fifth marker: comparing the ratios of the cfDNA fragments of different types to the m base fragments at the 5' end of the reference genome in all base fragments;
the first marker is obtained through the following steps: dividing a reference genome into a plurality of windows, and respectively obtaining copy number data in different windows on chromosome 1-22 in WGS data; the window size is 0.8-1.2M;
the second marker is obtained through the following steps: dividing a reference genome into a plurality of windows, and respectively counting the proportion of cfDNA fragments in all the windows, wherein the proportion is smaller than that of cfDNA fragments in all the windows;
the third marker is obtained through the following steps: comparing cfDNA fragments to a reference genome, taking a long arm and a short arm on each chromosome as regional ranges, and obtaining the number of reads in gradient intervals with different lengths in each range;
the fourth marker is obtained through the following steps: comparing cfDNA fragment data results to a reference genome to obtain the position of the 5' end of the read on the reference genome; obtaining sequence data of n bp bases at the upstream and downstream of the position as a base fragment set; taking the proportion of the obtained various base fragments in all fragments as a fourth characteristic set;
the fifth marker is obtained through the following steps: taking m base data of the 5' end of the cfDNA fragment as a base fragment set, and obtaining the proportion of various base fragments in all fragments; n is 4 and m is 8.
2. The method for constructing the malignant breast cancer screening model is characterized in that the model is used for classifying malignant breast cancer and benign breast nodules of a sample and comprises the following steps of:
step 1, extracting cfDNA from samples of a malignant breast cancer patient and a control group and sequencing to obtain cfDNA fragmentation information;
step 2, dividing a reference genome into a plurality of windows, and respectively obtaining copy number data in the range of each window as a first characteristic value;
step 3, comparing the result of the read data to a reference genome, dividing the reference genome into a plurality of windows, and respectively obtaining the short read quantity ratio and the long read quantity ratio in the range of each window, wherein the base length of the short read is 100-150bp, and the base length of the long read is 151-220bp;
step 4, comparing the read data result to a reference genome, taking a long arm and a short arm on each chromosome as regional ranges, and obtaining the number of reads in gradient intervals with different lengths in each range as a third feature set; the gradient intervals with different lengths are gradient ranges with different lengths, which are obtained by increasing the gradient ranges with 5bp step sizes in the range of 100-220 bp; the long and short arms are selected from the following chromosome arms:
chr1_p, ch4_q, chr8_p, ch11_q, chr16_q, ch20_p, ch1_q, ch5_p, chr8_q, chr12_p, ch17_p, ch20_q, ch2_p, ch5_q, ch9_p, ch12_q, ch17_q, ch21_q, ch2_q, ch6_p, ch9_q, ch13_q, ch18_p, ch22_q, chr3_p, ch6_q, ch10_p, ch14_q, ch18_q, chr3_q, ch7_p, ch10_q, ch15_q, ch19_p, ch4_p, ch7_q, ch11_p, ch16_q, ch19_q; wherein the character chr and the following digits represent chromosome numbers, q represents long arms, and p represents short arms;
step 5, comparing the read data result to a reference genome to obtain the position of the 5' end of the read on the reference genome; obtaining sequence data of n bp bases at the upstream and downstream of the position as a base fragment set; taking the proportion of the obtained various base fragments in all fragments as a fourth characteristic set;
step 6, taking m base data of the 5' end in the read data as a base fragment set, and obtaining the proportion of various base fragments in all fragments as a fifth characteristic set;
step 7, taking the first, second, third, fourth and fifth feature sets as initial feature values together, inputting the initial feature values as model feature vectors into a classification model, and training the model by taking the classification results of malignant breast cancer and benign nodules as output values to obtain a malignant breast cancer screening model;
the window in the step 2 is obtained by dividing the reference gene No. 1-22 chromosomes in a non-overlapping way with the length of 0.8-1.2 Mb;
the step 3 includes:
step 3-1, dividing a reference genome into a plurality of windows according to the length of 5 Mb;
step 3-2, counting the proportion of the cfDNA fragments in all the cfDNA fragments in each window relative to the cfDNA fragments of the short reading segment and the long reading segment in the window;
the number of the read sections in the step 4 is subjected to standardized treatment;
in the step 5, n is 4; in the step 6, m is 8;
in the step 7, inputting the first, second, third, fourth and fifth feature sets into a generalized linear regression model, a gradient lifting model, a random forest, a deep learning neural network model and an extreme gradient lifting model respectively to obtain a plurality of sub-models, and combining the sub-models into a linear relation model;
in the process of obtaining a plurality of sub-models, the sub-models obtained by screening are applied to the linear relation model after screening according to the classification performance of each sub-model of the first, second, third, fourth and fifth feature sets.
CN202310446774.6A 2023-04-24 2023-04-24 Application of gene marker in early screening of malignant breast cancer and benign breast nodule and construction method of screening model Active CN116153420B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310446774.6A CN116153420B (en) 2023-04-24 2023-04-24 Application of gene marker in early screening of malignant breast cancer and benign breast nodule and construction method of screening model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310446774.6A CN116153420B (en) 2023-04-24 2023-04-24 Application of gene marker in early screening of malignant breast cancer and benign breast nodule and construction method of screening model

Publications (2)

Publication Number Publication Date
CN116153420A CN116153420A (en) 2023-05-23
CN116153420B true CN116153420B (en) 2023-08-18

Family

ID=86356536

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310446774.6A Active CN116153420B (en) 2023-04-24 2023-04-24 Application of gene marker in early screening of malignant breast cancer and benign breast nodule and construction method of screening model

Country Status (1)

Country Link
CN (1) CN116153420B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116403637A (en) * 2023-06-08 2023-07-07 深圳市睿法生物科技有限公司 Model construction method of liver cirrhosis marker

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111863250A (en) * 2020-08-14 2020-10-30 中国科学院大学温州研究院(温州生物材料与工程研究所) Combined diagnosis model and system for early breast cancer
CN111910004A (en) * 2020-08-14 2020-11-10 中国科学院大学温州研究院(温州生物材料与工程研究所) Application of cfDNA in noninvasive diagnosis of early breast cancer
US10993653B1 (en) * 2018-07-13 2021-05-04 Johnson Thomas Machine learning based non-invasive diagnosis of thyroid disease
CN114927213A (en) * 2022-04-15 2022-08-19 南京世和基因生物技术股份有限公司 Construction method and detection device of multiple-cancer early screening model

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10993653B1 (en) * 2018-07-13 2021-05-04 Johnson Thomas Machine learning based non-invasive diagnosis of thyroid disease
CN111863250A (en) * 2020-08-14 2020-10-30 中国科学院大学温州研究院(温州生物材料与工程研究所) Combined diagnosis model and system for early breast cancer
CN111910004A (en) * 2020-08-14 2020-11-10 中国科学院大学温州研究院(温州生物材料与工程研究所) Application of cfDNA in noninvasive diagnosis of early breast cancer
CN114927213A (en) * 2022-04-15 2022-08-19 南京世和基因生物技术股份有限公司 Construction method and detection device of multiple-cancer early screening model

Also Published As

Publication number Publication date
CN116153420A (en) 2023-05-23

Similar Documents

Publication Publication Date Title
CN109872772B (en) Method for excavating colorectal cancer radiotherapy specific genes by using weight gene co-expression network
CN115171779B (en) Cancer driving gene prediction device based on graph attention network and multiple groups of chemical fusion
CN112086129B (en) Method and system for predicting cfDNA of tumor tissue
CN112750502B (en) Single cell transcriptome sequencing data clustering recommendation method based on two-dimensional distribution structure judgment
CN111243673B (en) Tumor screening model, and construction method and device thereof
CN115295074B (en) Application of gene marker in malignant pulmonary nodule screening, construction method of screening model and detection device
CN109801680B (en) Tumor metastasis and recurrence prediction method and system based on TCGA database
CN116153420B (en) Application of gene marker in early screening of malignant breast cancer and benign breast nodule and construction method of screening model
CN109872776B (en) Screening method for potential biomarkers of gastric cancer based on weighted gene co-expression network analysis and application thereof
WO2023197825A1 (en) Multi-cancer early screening model construction method and detection device
CN111370073B (en) Medicine interaction rule prediction method based on deep learning
CN113421608A (en) Construction method, detection device and computer readable medium of liver cancer early screening model
CN115896242A (en) Intelligent cancer screening model and method based on peripheral blood immune characteristics
CN111564177A (en) Construction method of early non-small cell lung cancer recurrence model based on DNA methylation
CN110714078A (en) Marker gene for colorectal cancer recurrence prediction in stage II and application thereof
CN110428899B (en) Multi-data integration circular RNA and disease correlation prediction method based on double random walk restart
CN113903398A (en) Intestinal cancer early-screening marker, detection method, detection device, and computer-readable medium
CN114613430A (en) Filtering method and computing equipment for false positive nucleotide variation sites
CN114373548A (en) Pancreatic cancer prognosis risk prediction method and device established based on metabolic genes
CN115881296B (en) Thyroid papillary carcinoma (PTC) risk auxiliary layering system
Nayak et al. ReCuRandom: A hybrid machine learning model for significant gene identification
CN110942808A (en) Prognosis prediction method and prediction system based on gene big data
CN114093512B (en) Survival prediction method based on multi-mode data and deep learning model
CN113862351B (en) Kit and method for identifying extracellular RNA biomarkers in body fluid sample
CN112382341B (en) Method for identifying biomarkers related to prognosis of esophageal squamous carcinoma

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant