CN116153420B - Application of gene marker in early screening of malignant breast cancer and benign breast nodule and construction method of screening model - Google Patents
Application of gene marker in early screening of malignant breast cancer and benign breast nodule and construction method of screening model Download PDFInfo
- Publication number
- CN116153420B CN116153420B CN202310446774.6A CN202310446774A CN116153420B CN 116153420 B CN116153420 B CN 116153420B CN 202310446774 A CN202310446774 A CN 202310446774A CN 116153420 B CN116153420 B CN 116153420B
- Authority
- CN
- China
- Prior art keywords
- model
- fragments
- reference genome
- cfdna
- base
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q1/00—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
- C12Q1/68—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
- C12Q1/6876—Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
- C12Q1/6883—Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material
- C12Q1/6886—Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material for cancer
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/10—Ploidy or copy number detection
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
- G16B30/10—Sequence alignment; Homology search
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
- G16B40/20—Supervised data analysis
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/30—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for calculating health indices; for individual health risk assessment
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q2600/00—Oligonucleotides characterized by their use
- C12Q2600/112—Disease subtyping, staging or classification
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q2600/00—Oligonucleotides characterized by their use
- C12Q2600/158—Expression markers
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Abstract
The invention relates to an application of a gene marker in early screening of malignant breast cancer and benign breast nodules and a construction method of a screening model, wherein a liquid biopsy whole genome WGS (high-performance sequencing system) low-depth sequencing is carried out on a blood plasma sample cfDNA, a window copy number variation analysis (CNV), a DNA fragmentation distribution difference (FSD), a DNA fragment length ratio difference (FSR), a DNA breakpoint sequence (BPM) and a DNA end sequence (EDM) are used for constructing an integrated model of a multi-feature multi-algorithm by utilizing automatic machine learning, so that the purpose of noninvasive accurate diagnosis of breast cancer is realized.
Description
Technical Field
The invention relates to early screening of malignant breast cancer and benign breast nodules, and belongs to the field of molecular biomedicine.
Background
Breast cancer is the most common among women worldwide and the most common type of cancer that causes death in women. According to global cancer report of GLOBOCAN2020, 226 thousands of new breast cancers account for 11.7% of the total cancer incidence in 2020, become global first malignant cancers, seriously threaten the physical and mental health of women and affect the quality of life. Research shows that early breast cancer screening can discover breast cancer earlier, and survival rate and quality of life are improved. Currently, the most common screening methods for Breast cancer are mainly Breast ultrasound (Breast Ultrasound), nuclear magnetic resonance (Breast MRI), automated Breast ultrasound systems (Automated Breast Ultrasound System, ABUS) and Mammography (Mammogram). However, each of the existing prior art techniques has some drawbacks, in that the quality of the examination of breast ultrasound techniques depends to some extent on the experience of the operator, patient compliance with nuclear magnetic resonance techniques is not high, and automated breast ultrasound systems are costly. Mammography, which has the highest rate of use, is currently the primary screening method for early breast cancer, but has a difference in sensitivity of detection for different breast types of patients. For example, in younger women, mammography has lower screening accuracy, and for women over 50 years old, since fibroglandular tissue of the breast becomes adipose tissue to replace with age, abnormal lesions near the adipose tissue are more easily detected by mammography, and screening accuracy becomes higher. Screening sensitivity has a certain relationship with age. In addition, for extremely dense chest forms (Almost entirely dense tissue) which account for about 10% of the total, there is the disadvantage of overdiagnosis and low sensitivity. Studies have shown that the AUC for the detection using mammography is 0.79 and that for the detection using breast ultrasound is 0.78. The sensitivity of the image detection of the breast cancer is limited to a certain extent, and the image detection is used as the basis for diagnosing the breast tumor, so that the risk of unnecessary invasive surgery is increased, and therefore, development of effective, practical and high-sensitivity screening means suitable for a wide population is urgently needed to carry out auxiliary screening on the population with high risk in the image detection diagnosis.
Disclosure of Invention
The invention provides a method for carrying out WGS sequencing on a blood plasma sample cfDNA, carrying out copy number change (copy number variation, CNV) of a characteristic difference 1Mb window of malignant breast cancer and benign nodule through a high-throughput sequencing result, carrying out DNA fragmentation distribution (Fragment size distribution, FSD), analyzing a DNA fragmentation length proportion (Fragment size ratio, FSR), a breakpoint sequence (Break Point Motif, BPM) and an End sequence (End Point Motif, EDM), and respectively modeling by utilizing a Generalized Linear Model (GLM), gradient Boost (GBM), random Forest (RF), deep Learning (DL) and extreme Gradient Boost (extreme Gradient boosting, xgboost), finally integrating a multi-feature multi-algorithm through a method for taking an average value thereof, obtaining a final risk coefficient and classifying, thereby realizing the purpose of noninvasive accurate diagnosis of malignant breast cancer.
Use of a genetic marker in the early screening of malignant breast cancer and benign breast nodules, said genetic marker comprising:
a first marker: copy number in different windows on chromosomes in WGS data;
a second marker: comparing the cfDNA fragments to the short read number ratio and the long read number ratio in different windows of the reference genome; the base length of the short reading segment is 100-150bp, and the base length of the long reading segment is 151-220bp;
third marker: comparing cfDNA fragments to the number of reads in different length gradient intervals on the long and short arms of the reference genome; the gradient intervals with different lengths are gradient ranges with different lengths, which are obtained by increasing the gradient ranges with the step length of 4-5bp in the range of 100-220 bp; the long and short arms are selected from the following chromosome arms:
chr1_p, ch4_q, chr8_p, ch11_q, chr16_q, ch20_p, ch1_q, ch5_p, chr8_q, chr12_p, ch17_p, ch20_q, ch2_p, ch5_q, ch9_p, ch12_q, ch17_q, ch21_q, ch2_q, ch6_p, ch9_q, ch13_q, ch18_p, ch22_q, chr3_p, ch6_q, ch10_p, ch14_q, ch18_q, chr3_q, ch7_p, ch10_q, ch15_q, ch19_p, ch4_p, ch7_q, ch11_p, ch16_q, ch19_q; wherein the character chr and the following digits represent chromosome numbers, q represents long arms, and p represents short arms;
fourth marker: comparing the duty ratio of the cfDNA fragments in all base sequences of n bp respectively at the upstream and downstream of the breakpoint on the reference genome;
fifth marker: the cfDNA fragments of different species were aligned to the ratio of m base fragments to the 5' end of the reference genome in all base fragments.
The first marker is obtained through the following steps: dividing a reference genome into a plurality of windows, and respectively obtaining copy number data in different windows on chromosome 1-22 in WGS data; the window size is 0.8-1.2M.
The second marker is obtained through the following steps: the reference genome is divided into a plurality of windows, and the proportion of cfDNA in all cfDNA fragments in each window is counted compared with that of the short reading and the long reading in the window.
The third marker is obtained through the following steps: cfDNA fragments were aligned to a reference genome with long and short arms on each chromosome as regional ranges, respectively, and the number of reads in gradient intervals of different lengths within each range was obtained.
The fourth marker is obtained through the following steps: comparing cfDNA fragment data results to a reference genome to obtain the position of the 5' end of the read on the reference genome; obtaining sequence data of n bp bases at the upstream and downstream of the position as a base fragment set; the fourth feature set was the ratio of the various base fragments obtained to the total fragments.
The fifth marker is obtained through the following steps: the m base data of the 5' end of the cfDNA fragment is used as a base fragment set, and the proportion of various base fragments in all fragments is obtained.
n is 4 and m is 8.
The method for constructing the malignant breast cancer screening model is used for classifying malignant breast cancer and benign breast nodules of a sample and comprises the following steps of:
step 1, extracting cfDNA from samples of a malignant breast cancer patient and a control group (benign nodule patient) and sequencing to obtain cfDNA fragmentation information;
step 2, dividing a reference genome into a plurality of windows, and respectively obtaining copy number data in the range of each window as a first characteristic value;
step 3, comparing the result of the read data to a reference genome, dividing the reference genome into a plurality of windows, and respectively obtaining the short read quantity ratio and the long read quantity ratio in the range of each window, wherein the base length of the short read is 100-150bp, and the base length of the long read is 151-220bp;
step 4, comparing the read data result to a reference genome, taking a long arm and a short arm on each chromosome as regional ranges, and obtaining the number of reads in gradient intervals with different lengths in each range as a third feature set; the gradient intervals with different lengths are gradient ranges with different lengths, which are obtained by increasing the gradient ranges with the step length of 4-5bp in the range of 100-220 bp; the long and short arms are selected from the following chromosome arms:
chr1_p, ch4_q, chr8_p, ch11_q, chr16_q, ch20_p, ch1_q, ch5_p, chr8_q, chr12_p, ch17_p, ch20_q, ch2_p, ch5_q, ch9_p, ch12_q, ch17_q, ch21_q, ch2_q, ch6_p, ch9_q, ch13_q, ch18_p, ch22_q, chr3_p, ch6_q, ch10_p, ch14_q, ch18_q, chr3_q, ch7_p, ch10_q, ch15_q, ch19_p, ch4_p, ch7_q, ch11_p, ch16_q, ch19_q; wherein the character chr and the following digits represent chromosome numbers, q represents long arms, and p represents short arms;
step 5, comparing the read data result to a reference genome to obtain the position of the 5' end of the read on the reference genome; obtaining sequence data of n bp bases at the upstream and downstream of the position as a base fragment set; taking the proportion of the obtained various base fragments in all fragments as a fourth characteristic set;
step 6, taking m base data of the 5' end in the read data as a base fragment set, and obtaining the proportion of various base fragments in all fragments as a fifth characteristic set;
and 7, taking the first, second, third, fourth and fifth feature sets as initial feature values, inputting the initial feature values as model feature vectors into a classification model, and training the model by taking the classification results of the malignant breast cancer and the benign nodule as output values to obtain a malignant breast cancer screening model.
The window in the step 2 is obtained by dividing the reference gene No. 1-22 chromosomes in a non-overlapping way with the length of 0.8-1.2 Mb.
The step 3 includes:
step 3-1, dividing a reference genome into a plurality of windows according to the length of 5 Mb;
step 3-2, counting the proportion of the cfDNA fragments in all the cfDNA fragments in each window relative to the cfDNA fragments of the short reading segment and the long reading segment in the window;
the number of reads described in step 4 is normalized.
In the step 5, n is 4;
in the step 6, m is 8;
in the step 7, inputting the first, second, third, fourth and fifth feature sets into the generalized linear regression model, the gradient lifting model, the random forest, the deep learning neural network model and the extreme gradient lifting model respectively to obtain a plurality of sub-models, and combining the sub-models into a linear relation model.
In the process of obtaining a plurality of sub-models, the sub-models obtained by screening are applied to the linear relation model after screening according to the classification performance of each sub-model of the first, second, third, fourth and fifth feature sets.
The beneficial effects of the invention are as follows: and carrying out statistics and analysis on the fragment length proportion, copy number change and fragment distribution of WGS cfDNA of 98 malignant breast cancer patients and 93 benign nodule patients, and training and integrating a deep learning neural network model through automatic machine learning by utilizing a generalized linear regression model, a gradient lifting model, a random forest model and an extreme gradient lifting model to obtain a final model. The invention screens malignant breast cancer based on the fragmentation result of high-throughput low-depth sequencing of the cfDNA of the blood plasma for the first time. Compared with the existing analysis and detection method, the model has higher sensitivity, can more effectively classify malignant breast tumors and benign nodules, and reduces unnecessary operation risks and complications risks.
Drawings
FIG. 1 is a schematic diagram of a model building process;
FIG. 2 is a variance distribution diagram of the most diverse of the individual features
FIG. 3 is a graph of AUC curves for various features in a training set and final integrated model
FIG. 4 is a graph of AUC of a final integrated model in a validation set
FIG. 5 is a graph of the predicted score distribution in the final model for validating concentrated benign nodules and malignant breast tumors
Description of the embodiments
The calculation method in the invention is detailed as follows:
the invention firstly needs the steps of cfDNA extraction, library establishment, sequencing and the like from a blood sample. The extraction and library-building method is not particularly limited, and can be adjusted from the extraction methods in the prior art, and the base information of cfDNA can be obtained by using the sequencing technology in the prior art in the sequencing process. The reference genome in the present invention is in version hg 19.
The purpose of the model in this patent is to distinguish between malignant breast tumors (malignant breast cancer) and benign nodules (benign nodes), classifying the samples. In the training process, the patients judged to be benign nodules according to the subsequent postoperative pathology are taken as a control group, and the patients judged to be malignant breast cancer are taken as positive.
The data set used in the model construction process of the invention is as follows:
TABLE 1
Extraction and sequencing method of plasma cfDNA sample:
the patient was subjected to liquid biopsy, a 10ml whole blood sample was collected from the patient using a purple blood collection tube (EDTA anticoagulant tube), and the plasma was centrifuged in time (within 2 hours) and transferred to laboratory analysis at-80 degrees celsius under frozen storage. After transport to the laboratory, the plasma samples were subjected to ctDNA extraction using QIAGEN plasma DNA extraction kit according to the instructions. After pooling the collected ctDNA samples, 5-fold sequencing of WGS was performed. After the off-machine data were obtained, the data were aligned to the human reference genome (hg 19 edition) to obtain base data information for the corresponding reads.
The model building process of the patent is mainly as follows:
step 1, extracting cfDNA from samples of a positive group and a control group and sequencing to obtain cfDNA fragmentation data;
step 2, dividing a reference genome into a plurality of windows, and respectively obtaining copy number data in the range of each window as a first characteristic;
and 3, comparing the read data result to a reference genome, dividing the reference genome into a plurality of windows, obtaining the positions of the windows on the reference genome, and obtaining the lengths of the cfDNA fragments. Taking the proportion of the cfDNA short reading segment and the cfDNA long reading segment in all fragments in the window as a second characteristic value;
step 4, calculating the fragment coverage of 5bp as a read length in the range of 100bp to 220bp on the level of each chromosome arm as a third characteristic.
Step 5, the frequency of base combination at the breakpoint of the DNA fragment is used as a fourth characteristic;
step 6, the frequency of occurrence of base combinations at the ends of the DNA fragments is used as a fifth feature;
and 7, inputting model feature vectors of samples of the positive group and the control group into a first layer model, selecting a best model corresponding to each feature, and carrying out averaging integration to obtain a final model output result.
The characteristic values of the patent are five, and are respectively described in detail as follows:
window copy number variation 1Mb (1 Mb-Bin Copy Number Variation, CNV)
Copy number changes have a high degree of correlation with individual cancers, although it has been possible to distinguish by detecting copy number changes in some cancer-associated genes or specific genomic intervals, there are other rare or unknown genes or intervals that can provide information on potential copy number changes.
The copy data collection method comprises the following steps: firstly, collecting WGS data of 30 healthy people, dividing reference gene No. 1-22 chromosomes with a length of 1Mb in a non-overlapping manner, calculating the reading depth in each window by using bedtools coverage for each sample, correcting according to the GC content and average comparison capability record (UCSC bigWig file) of each window, taking the median depth of 30 healthy people in each window as a representative, and obtaining a population comparison baseline of 2475 window reading depths; for each sample to be tested, 2475 window individual read depth information is also obtained, and the copy number change logarithm of each window, namely log2 (depth after the sample to be tested is corrected and homogenized/depth after the group baseline is corrected and homogenized) is constructed by utilizing a hidden Markov model (Hidden Markov Model, HMM) and the group reference base line depth of each window, so that the copy number change information of each sample to be tested is obtained.
2. cfDNA DNA fragmentation length ratio difference (Fragment size ratio, FSR)
For the DNA fragment size duty cycle, it reflects the duty cycle characteristic of the length size of the cfDNA read. Machine learning with DNA fragment size coverage depth (fragmentation size ratio) builds a predictive model to distinguish between malignant breast cancer and benign breast nodules. By comparing the lengths of cfDNA reads of malignant breast cancer and benign breast nodules, it was found that there was a difference in the distribution of the number of fragments between 100-150bp,151-220bp and 100-220bp on the chromosome, which could be used as a distinguishing feature.
cfDNA read length data was obtained by the following method: in the aligned BAM, the mass, length and alignment position information of each read was recorded, and the human reference genome was selected from the hg19 sequence provided by the university of California Kruz division (University of California, santa Cruz, UCSC). Human reference genome was cut into 541 windows according to 5Mb length, and the total number of reads (100-220 bp), the number of short reads (100-150 bp) and the number of long reads (151-220 bp) in each window were counted, respectively. And respectively carrying out standardized conversion on each reading number according to the counting results of various reading numbers in all windows, namely, standardized value= (original value-average value)/standard deviation. A set of numbers of reads of different lengths (short as well as long) of 1082 (541 x 2 = 1082) sets is thus obtained. In the characteristic value data set, the read data duty ratio in different length ranges in each window is calculated according to the number of the reads, and the proportion value is calculated by the number of cfDNA in the corresponding length range/the number of all cfDNA reads in the window.
cfDNA fragmentation size distribution (Fragment Size Distribution, FSD)
On the basis of the obtained size ratio of the DNA fragments, 39 areas of the long and short arms of each chromosome of the human reference genome are used as windows for obtaining high-resolution reading results, and the windows are as follows:
TABLE 2
chr1_p | chr4_q | chr8_p | chr11_q | chr16_q | chr20_p |
chr1_q | chr5_p | chr8_q | chr12_p | chr17_p | chr20_q |
chr2_p | chr5_q | chr9_p | chr12_q | chr17_q | chr21_q |
chr2_q | chr6_p | chr9_q | chr13_q | chr18_p | chr22_q |
chr3_p | chr6_q | chr10_p | chr14_q | chr18_q | |
chr3_q | chr7_p | chr10_q | chr15_q | chr19_p | |
chr4_p | chr7_q | chr11_p | chr16_p | chr19_q |
Fragments of 100-220bp are increased by 5bp, 24 length gradients (for example, 100-104bp and 105-109bp … … on the 1q arm of chr 1) are divided, the number of fragments of each length gradient in each long and short arm window is counted, and standardized conversion is carried out, so that 936 characteristic results (936=39×24 length gradient standardized results) are obtained in total according to the size distribution result of the high-resolution DNA fragments.
4. cfDNA BreakPoint sequence (BreakPoint Motif, BPM)
The human reference genome is a DNA duplex structure, linked by base-complementary pairing hydrogen-dissociating bonds; in the normal aging and cancer progression process, the pH value of the surrounding environment of the cell changes, so that complementary hydrogen bonds of bases are destroyed, and breakage occurs; after the DNA enters the blood circulation, non-random fragmentation of the DNA occurs. The process may be related to tissue origin, disease state, nucleosome opening and endonuclease activity. Because of the base sequence at the break, the sequence ratio of the information comprising the sequence at the different break points will also be different. The collecting method comprises the following steps: in the compared bam, the basic information of each read and the compared position are recorded, 4bp sequences around the breakpoint of the human reference genome sequence coordinate of the 5' end of each read are confirmed, the number of reads of 8bp length sequences (total 4 times 8=65536) at each breakpoint is counted, and therefore the sequence read proportion at 65536 breakpoints, for example, the AAAAAAAAAA read proportion=AAAAAAAA number of reads/total number of sequence reads at all breakpoints is calculated.
cfDNA terminal sequence (End Motif, EDM)
After alignment, the 5' -end 8bp sequence in each read is obtained, and the number of reads of each terminal sequence (total 4 times 8=65536) is counted, so that the 65536 terminal sequence read ratio, for example, AAAAAAAA sequence ratio=aaaaaaaaaa number of reads/total number of all terminal sequence reads is calculated.
Through the above data acquisition, initial data vectors of the five types of data can be obtained, respectively. Next, the corresponding calculation method is redesigned:
1. generalized linear regression model (Generalized Logistic Regression, glm)
The generalized linear regression model is a common algorithm in machine learning, and aims to overcome the defects of the linear regression model, solve the problem of discrete dependent variables which cannot be processed by the common linear regression model, and is popularization of the linear regression model. He establishes a linking function, which is a linear and nonlinear bridge, by establishing a bridge between the linear prediction result and the value of the dependent variable y.
2. Gradient lifting model (Gradient Boosting GBM)
In each iteration of the gradient lifting model algorithm, firstly, calculating the negative gradient of the current model on all samples, then training a new weak classifier by taking the value as a target to fit, calculating the weight of the weak classifier, and finally, updating the model.
3. Random forest (random forest)
Random forests are a powerful classification and regression tool for the case of high dimensionality and multiple collinearity. When a group of data sets are provided, the random forest can randomly extract part of information to generate a group of decision forest which helps classification or regression, and node splitting attribute is made, and the random extraction is continuously repeated until the tree cannot be split any more; and finally, combining all the split attribute results to obtain a final prediction result.
4. Deep learning neural network model neural network (Deep Learning Neural network)
A neural network consists of inputs, weights, deviations or thresholds, and outputs, with any single node being activated if its output is above a specified threshold, and data is sent to the next layer of the network. Each node of the input layer performs point-to-point calculation with each node of the hidden layer, and a weighted summation and activation method is applied. Each value calculated using the hidden layer is calculated using the same method, and the output layer. The method has the advantages of high classification accuracy, strong parallel distribution processing capability, and strong distribution storage and learning capability.
5. Extreme gradient lifting (extreme gradient boosting, xgboost)
Is an optimization algorithm based on an integrated ideological addition model of a gradient lifting decision tree (Gradient Boosting Decision Tree, GBDT). The method utilizes a second-order Taylor formula to develop, optimizes a loss function, improves calculation accuracy, utilizes a regularization term to simplify a model, avoids overfitting, adopts a Blocks storage structure, and can perform parallel calculation.
In addition, the patent also adopts a random search super-parameter (Random Grid Search Parameters) algorithm for optimizing the model. Random search is a common method of machine learning super-parametric optimization. The random search is to randomly extract parameter values from a specific model parameter range and select an optimal parameter combination from a plurality of sampled parameter values. Instead of trying all possible combinations, the method is to choose a specific number of random combinations of one random value for each super parameter. Compared with the model tuning by using manual tuning and network searching, the random searching can achieve a better effect by using fewer searching times, and provides a more efficient solution (particularly under the condition of a large number of parameters).
In the optimization and parameter adjustment process of the model, the super parameters of five algorithms used in the patent are shown in the following table:
TABLE 3 Table 3
Algorithm (Algorithm) | Model super parameters (Hyperparameters) |
Generalized linear regression model (GLM) | alpha {0.0,0.2,0.4,0.6,0.8,1.0} |
Extreme gradient lifting model (XGBoost) | max_depth {3,4,5,6,7,8,9,10,15,20};min_rows {0.01,0.1,1.0,3.0,5.0,10.0,15.0, 20.0}min_child_weight {3,5,10,15,20} |
Random Forest (Random Forest) | max_depth {3,4,5,6,7,8,9,10,11,12,13,14,15,16,17}min_rows {1,5,10,15,30,100} ntrees: 10000 |
Deep learning neural network model (Deep Learning) | epsilon {1e-6,1e-7,1e-8,1e-9}hidden {20},{50},{100}rho {0.9, 0.95, 0.99} |
Gradient lifting model (GBM) | max_depth{3,4,5,6,7,8,9,10}min_rows{1,5,10,15,30,100}nbins{10,20,40,60} |
After the five initial data of 98 cases of malignant breast cancers and 93 cases of benign breast nodules are obtained, taking a Copy Number Variation (CNV) statistical result as an input value, classifying a malignant breast tumor sample and a benign breast nodule sample through five classification models respectively, and carrying out parameter and structure variation on the five models through random search super parameters in the screening process respectively to be used as a sub-model for training data and suggesting the model, and then selecting three optimal sub-models of the feature, wherein the AUC curve of a training set of the models is used as an index of classification effect in the screening process; similarly, by collecting cfDNA Fragment Size Ratios (FSR) of malignant breast tumors and benign breast nodules, cfDNA Fragment Size Distribution (FSD), breakpoint sequences (BPM) and end sequences (EDM) were also used as input values, respectively, and three optimal sub-models were selected for each feature (specific model optimization procedure is the same as above), and the calculation results of 3 x5=15 models were obtained in total through the above calculation procedure. In each calculation, a contribution value of each feature vector to the classification result may be obtained. The 3 optimal models (total 15 models) selected for each feature are shown in the following table:
the feature variables before the contribution value row of the optimal model selected by each feature and the contribution values are as follows:
copy Number Variation (CNV) extreme gradient lifting XGBoost model:
TABLE 4 Table 4
Variable(s) | Contribution value | Variable(s) | Contribution value | ||
1 | Cnv.22.46000001. 47000000 | 1 | 21 | Cnv.3.50000001.5 1000000 | 0.192185063 |
2 | Cnv.4.176000001. 177000000 | 0.707718729 | 22 | Cnv.4.135000001. 136000000 | 0.187070758 |
3 | Cnv.4.103000001. 104000000 | 0.702344457 | 23 | Cnv.12.97000001. 98000000 | 0.159003193 |
4 | Cnv.6.132000001. 133000000 | 0.603610479 | 24 | Cnv.7.82000001.8 3000000 | 0.158473368 |
5 | Cnv.22.48000001. 49000000 | 0.584821318 | 25 | Cnv.10.126000001 .127000000 | 0.153133441 |
6 | Cnv.3.101000001. 102000000 | 0.51322448 | 26 | Cnv.8.33000001.3 4000000 | 0.152661605 |
7 | Cnv.3.153000001. 154000000 | 0.497560161 | 27 | Cnv.22.29000001. 30000000 | 0.131411155 |
8 | Cnv.13.75000001. 76000000 | 0.480732668 | 28 | Cnv.5.122000001. 123000000 | 0.128729099 |
9 | Cnv.12.76000001. 77000000 | 0.353319757 | 29 | Cnv.6.3000001.40 00000 | 0.128214895 |
10 | Cnv.9.134000001. 135000000 | 0.344604821 | 30 | Cnv.1.241000001. 242000000 | 0.1216052 |
11 | Cnv.2.129000001. 130000000 | 0.329638899 | 31 | Cnv.12.82000001. 83000000 | 0.118804964 |
12 | Cnv.18.34000001. 35000000 | 0.267532225 | 32 | Cnv.13.105000001 .106000000 | 0.114712761 |
13 | Cnv.8.110000001. 111000000 | 0.26669307 | 33 | Cnv.7.5000001.60 00000 | 0.114293264 |
14 | Cnv.3.80000001.8 1000000 | 0.256606013 | 34 | Cnv.8.10000001.1 1000000 | 0.105616978 |
15 | Cnv.16.56000001. 57000000 | 0.253101489 | 35 | Cnv.3.189000001. 190000000 | 0.105581721 |
16 | Cnv.3.21000001.2 2000000 | 0.232077932 | 36 | Cnv.11.97000001. 98000000 | 0.102099504 |
17 | Cnv.16.50000001. 51000000 | 0.22288311 | 37 | Cnv.9.107000001. 108000000 | 0.09950168 |
18 | Cnv.3.41000001.4 2000000 | 0.211229986 | 38 | Cnv.19.34000001. 35000000 | 0.0989202 |
19 | Cnv.6.81000001.8 2000000 | 0.202457945 | 39 | Cnv.3.52000001.5 3000000 | 0.089593942 |
20 | Cnv.15.62000001. 63000000 | 0.201399251 | 40 | Cnv.18.35000001. 36000000 | 0.085212451 |
21 | Cnv.3.50000001.5 1000000 | 0.192185063 | |||
22 | Cnv.4.135000001. 136000000 | 0.187070758 | |||
23 | Cnv.12.97000001. 98000000 | 0.159003193 | |||
24 | Cnv.7.82000001.8 3000000 | 0.158473368 | |||
25 | Cnv.10.126000001 .127000000 | 0.153133441 |
cfDNA Fragment Size Ratio (FSR) extreme gradient lifting XGBoost model:
TABLE 5
Variable(s) | Contribution value | Variable(s) | Contribution value | ||
1 | Frag.longA408 | 1 | 26 | Frag.longA60 | 0.09980968 |
2 | Frag.shortA64 | 0.85819829 | 27 | Frag.shortA251 | 0.09972169 |
3 | Frag.longA46 | 0.64443052 | 28 | Frag.longA535 | 0.0921953 |
4 | Frag.longA102 | 0.63926766 | 29 | Frag.longA523 | 0.0898759 |
5 | Frag.longA223 | 0.42440395 | 30 | Frag.longA237 | 0.08636216 |
6 | Frag.longA316 | 0.29105056 | 31 | Frag.longA44 | 0.08317273 |
7 | Frag.longA30 | 0.25139644 | 32 | Frag.shortA227 | 0.08238297 |
8 | Frag.longA101 | 0.24885936 | 33 | Frag.longA492 | 0.07812515 |
9 | Frag.shortA346 | 0.24481562 | 34 | Frag.longA71 | 0.07647141 |
10 | Frag.longA248 | 0.23781 | 35 | Frag.longA257 | 0.0744764 |
11 | Frag.longA32 | 0.19572478 | 36 | Frag.longA389 | 0.07397167 |
12 | Frag.shortA511 | 0.19031787 | 37 | Frag.shortA360 | 0.07290724 |
13 | Frag.longA163 | 0.16107737 | 38 | Frag.longA430 | 0.06958323 |
14 | Frag.shortA310 | 0.15044681 | 39 | Frag.shortA87 | 0.06900854 |
15 | Frag.shortA146 | 0.13785492 | 40 | Frag.shortA312 | 0.06695638 |
16 | Frag.shortA491 | 0.1351144 | 41 | Frag.longA108 | 0.06349707 |
17 | Frag.longA185 | 0.1294817 | 42 | Frag.shortA389 | 0.06096496 |
18 | Frag.longA130 | 0.12876509 | 43 | Frag.shortA35 | 0.05931402 |
19 | Frag.shortA408 | 0.12708398 | 44 | Frag.shortA61 | 0.05915703 |
20 | Frag.shortA332 | 0.12464014 | 45 | Frag.shortA393 | 0.05727665 |
21 | Frag.shortA253 | 0.12012323 | 46 | Frag.shortA353 | 0.05415344 |
22 | Frag.longA245 | 0.11298206 | 47 | Frag.longA195 | 0.0530392 |
23 | Frag.longA219 | 0.10198909 | 48 | Frag.shortA63 | 0.05250434 |
24 | Frag.shortA196 | 0.10106005 | 49 | Frag.longA517 | 0.0520624 |
25 | Frag.longA208 | 0.0999192 | 50 | Frag.shortA361 | 0.05163735 |
cfDNA Fragment Size Distribution (FSD) deep learning neural network regression model (DeepLearning, NN):
TABLE 6
Variable(s) | Contribution value | Variable(s) | Contribution value | ||
1 | FragArm.chr19.19 p.frag.200.204 | 1 | 26 | FragArm.chr22.22 q.frag.215.219 | 0.62650544 |
2 | FragArm.chr19.19 q.frag.205.209 | 0.93421996 | 27 | FragArm.chr7.7q. frag.170.174 | 0.62340617 |
3 | FragArm.chr17.17 q.frag.170.174 | 0.84517437 | 28 | FragArm.chr3.3p. frag.170.174 | 0.621714 |
4 | FragArm.chr11.11 q.frag.170.174 | 0.74109721 | 29 | FragArm.chr21.21 q.frag.215.219 | 0.61096567 |
5 | FragArm.chr8.8p. frag.215.219 | 0.72974157 | 30 | FragArm.chr20.20 p.frag.210.214 | 0.60664117 |
6 | FragArm.chr18.18 q.frag.170.174 | 0.72518045 | 31 | FragArm.chr7.7p. frag.170.174 | 0.6036374 |
7 | FragArm.chr4.4q. frag.170.174 | 0.71589434 | 32 | FragArm.chr9.9q. frag.215.219 | 0.60263228 |
8 | FragArm.chr22.22 q.frag.170.174 | 0.71454889 | 33 | FragArm.chr9.9q. frag.170.174 | 0.59463716 |
9 | FragArm.chr8.8q. frag.170.174 | 0.71383041 | 34 | FragArm.chr19.19 p.frag.205.209 | 0.58072054 |
10 | FragArm.chr15.15 q.frag.170.174 | 0.70367897 | 35 | FragArm.chr17.17 p.frag.200.204 | 0.57559198 |
11 | FragArm.chr6.6p. frag.170.174 | 0.70319629 | 36 | FragArm.chr16.16 q.frag.215.219 | 0.57427329 |
12 | FragArm.chr18.18 p.frag.175.179 | 0.69913715 | 37 | FragArm.chr2.2p. frag.170.174 | 0.57368487 |
13 | FragArm.chr20.20 p.frag.175.179 | 0.68919247 | 38 | FragArm.chr13.13 q.frag.170.174 | 0.57236755 |
14 | FragArm.chr19.19 q.frag.170.174 | 0.68403781 | 39 | FragArm.chr20.20 p.frag.205.209 | 0.57026112 |
15 | FragArm.chr19.19 q.frag.210.214 | 0.67318714 | 40 | FragArm.chr1.1q. frag.170.174 | 0.56910765 |
16 | FragArm.chr9.9p. frag.215.219 | 0.67183381 | 41 | FragArm.chr10.10 p.frag.170.174 | 0.56614232 |
17 | FragArm.chr12.12 p.frag.170.174 | 0.64841783 | 42 | FragArm.chr14.14 q.frag.170.174 | 0.56131285 |
18 | FragArm.chr1.1p. frag.170.174 | 0.63953185 | 43 | FragArm.chr8.8p. frag.175.179 | 0.5551706 |
19 | FragArm.chr20.20 q.frag.215.219 | 0.6361028 | 44 | FragArm.chr5.5p. frag.175.179 | 0.55327171 |
20 | FragArm.chr12.12 p.frag.215.219 | 0.63554609 | 45 | FragArm.chr19.19 q.frag.175.179 | 0.55095059 |
21 | FragArm.chr6.6q. frag.170.174 | 0.63494736 | 46 | FragArm.chr12.12 q.frag.170.174 | 0.55088931 |
22 | FragArm.chr17.17 p.frag.170.174 | 0.63375968 | 47 | FragArm.chr10.10 q.frag.170.174 | 0.54725403 |
23 | FragArm.chr2.2q. frag.170.174 | 0.63122767 | 48 | FragArm.chr10.10 p.frag.215.219 | 0.54440355 |
24 | FragArm.chr3.3q. frag.170.174 | 0.62818843 | 49 | FragArm.chr18.18 p.frag.195.199 | 0.53650242 |
25 | FragArm.chr5.5p. frag.215.219 | 0.62724286 | 50 | FragArm.chr19.19 p.frag.175.179 | 0.53099901 |
Breakpoint sequence deep learning neural network regression model (DeepLearning, NN):
TABLE 7
Variable(s) | Contribution value | Variable(s) | Contribution value | ||
1 | BPM_ACGAAGTT | 1 | 26 | BPM_AGAAGTAC | 0.66628772 |
2 | BPM_CAATTATA | 0.94607401 | 27 | BPM_TAACGCGC | 0.66143578 |
3 | BPM_AGCGGTTC | 0.89684391 | 28 | BPM_GTGCGTAA | 0.6593492 |
4 | BPM_CCGGATCT | 0.8741132 | 29 | BPM_TCGTATCT | 0.65734679 |
5 | BPM_GACTCGCG | 0.85113877 | 30 | BPM_CCGTAACA | 0.65716744 |
6 | BPM_TCCATGCA | 0.81111783 | 31 | BPM_AAAAGGTC | 0.65640622 |
7 | BPM_GTGCAAAT | 0.8035053 | 32 | BPM_GCCGCGGT | 0.6516785 |
8 | BPM_TCGACGGA | 0.79976958 | 33 | BPM_ATAAGGGC | 0.64762968 |
9 | BPM_CGGCACGG | 0.78045821 | 34 | BPM_TTCGTTTA | 0.64635307 |
10 | BPM_ATCCGTAA | 0.76044983 | 35 | BPM_GCGGCCGG | 0.64323455 |
11 | BPM_GGCGTGCC | 0.75822753 | 36 | BPM_TCCGTTCT | 0.64271921 |
12 | BPM_CCGGAACG | 0.73871589 | 37 | BPM_ATGCGAAG | 0.64210856 |
13 | BPM_CAAAACTA | 0.72939718 | 38 | BPM_GCTGAGCA | 0.6320973 |
14 | BPM_TATAGTTA | 0.71431983 | 39 | BPM_TGATTATA | 0.62828749 |
15 | BPM_AGCACAAT | 0.71318734 | 40 | BPM_TACTTGCC | 0.62744111 |
16 | BPM_GTTCCGGG | 0.71131843 | 41 | BPM_AAACCCCC | 0.62479603 |
17 | BPM_GGCTTGAA | 0.70888007 | 42 | BPM_ATCCCCGT | 0.61880386 |
18 | BPM_AACGTTCG | 0.7037878 | 43 | BPM_CATAGGAA | 0.61735392 |
19 | BPM_TCGTGCGG | 0.70090812 | 44 | BPM_GTGCTCGT | 0.61727566 |
20 | BPM_AACGACCC | 0.69668245 | 45 | BPM_TCCGAAAA | 0.6157372 |
21 | BPM_CCGCGGAT | 0.69300968 | 46 | BPM_TCGGCGAT | 0.6151548 |
22 | BPM_TGTATCCT | 0.67933434 | 47 | BPM_CTCGTCCC | 0.6127463 |
23 | BPM_ATCTTTCC | 0.67813677 | 48 | BPM_TTCGGTTT | 0.61045146 |
24 | BPM_TGCGAGTC | 0.67347133 | 49 | BPM_TAAAGTTA | 0.60864198 |
25 | BPM_ACGTCTTG | 0.6698823 | 50 | BPM_TATCGCCC | 0.60788792 |
End sequence deep learning neural network regression model (DeepLearning, NN):
TABLE 8
Variable(s) | Contribution value | Variable(s) | Contribution value | ||
1 | EDM_GAGTCGAT | 1 | 26 | EDM_CGTACGCG | 0.67282635 |
2 | EDM_CAGCCGCT | 0.94303036 | 27 | EDM_CTAACGTA | 0.67080379 |
3 | EDM_AGCGTTAC | 0.89571661 | 28 | EDM_GGGATATG | 0.66796774 |
4 | EDM_GAACGTAT | 0.82257056 | 29 | EDM_TGTACCTT | 0.66630644 |
5 | EDM_CGTGCTAG | 0.78153014 | 30 | EDM_GCGATAGA | 0.66531879 |
6 | EDM_GGTGATAA | 0.73706114 | 31 | EDM_GCATTCGG | 0.6649462 |
7 | EDM_GGATCGGG | 0.73158205 | 32 | EDM_ACGATTCT | 0.66256285 |
8 | EDM_AACGACGT | 0.72405159 | 33 | EDM_AGGCGCTA | 0.65348101 |
9 | EDM_TAACGAGT | 0.72331977 | 34 | EDM_ATCCAACG | 0.64768285 |
10 | EDM_CTATATAA | 0.72320271 | 35 | EDM_CTCGTGTT | 0.64099795 |
11 | EDM_GTTCCGAA | 0.72225255 | 36 | EDM_ATATTGCC | 0.63993442 |
12 | EDM_GCGCTATC | 0.71595263 | 37 | EDM_CAGTCAAG | 0.63803053 |
13 | EDM_ACGAACGA | 0.71322638 | 38 | EDM_GCGAAGCG | 0.63759154 |
14 | EDM_TCGACATA | 0.69361341 | 39 | EDM_TCCTGTGG | 0.63716823 |
15 | EDM_ACCTCGCC | 0.69169796 | 40 | EDM_ACTCTCTC | 0.63659197 |
16 | EDM_CACCGGAT | 0.69112372 | 41 | EDM_GGCGATCA | 0.63594854 |
17 | EDM_CGTATCGG | 0.69073415 | 42 | EDM_CCCCCCTG | 0.63546211 |
18 | EDM_GGGTTGCA | 0.69031698 | 43 | EDM_TCGTGCCA | 0.63408917 |
19 | EDM_GACCGGCG | 0.68690753 | 44 | EDM_TCCCTACT | 0.63252074 |
20 | EDM_GTACGTCC | 0.68384075 | 45 | EDM_GCCGTGAC | 0.63248158 |
21 | EDM_GGTGGACA | 0.68310016 | 46 | EDM_CGTCGCTG | 0.63233876 |
22 | EDM_GGCGCGAG | 0.67621911 | 47 | EDM_GATTCGCT | 0.63176686 |
23 | EDM_AGGTTCTC | 0.67594147 | 48 | EDM_CAATGCCC | 0.63152003 |
24 | EDM_CGGGTATA | 0.67563164 | 49 | EDM_TTAGTCGT | 0.63035995 |
25 | EDM_GAGGTATT | 0.67491972 | 50 | EDM_CAAATCCT | 0.63023627 |
The 15 training models were converted into the final linear equation: ALLSTacked= (CNVmode 1+CNVmode 2+CNVmode 3+FSRmode 1+FSRmode 2+FSRmode 3+FSDmode 1 +: FSDmodel 2+FSDmodel 3+BPMmodel 1+BPMmodel 2+BPMmodel 3+EDMmodel 1+EDMmodel 2+EDMmodel 3)/15.
TABLE 9
Model | Model basic parameters |
CNV_1 | Max_depth = {9}, Min_rows = {15}, Min_child_weight = {15} |
CNV_2 | Max_depth = {9}, Min_rows = {15}, Min_child_weight = {15} |
CNV_3 | Max_depth = {15}, Min_rows = {15}, Min_child_weight = {15} |
FSR_1 | Max_depth = {3}, Min_rows = {5}, Min_child_weight = {5} |
FSR_2 | epsilon = 1e-6, , rho = {0.9}, hidden = {100} |
FSR_3 | Max_depth = {12}, Min_rows = {3}, Min_child_weight = {3} |
FSD_1 | epsilon = 1e-6, rho = {0.95}, hidden = {100} |
FSD_2 | epsilon = 1e-6, rho = {0.9}, hidden = {50} |
FSD_3 | epsilon = 1e-6, rho = {0.95}, hidden = {20} |
BPM_1 | epsilon = 1e-6, rho = {0.9}, hidden = {50} |
BPM_2 | epsilon = 1e-6, rho = {0.95}, hidden = {20} |
BPM_3 | epsilon = 1e-6, rho = {0.9}, hidden = {50} |
EDM_1 | epsilon = 1e-7, rho = {0.95}, hidden = {50} |
EDM_2 | epsilon = 1e-6, rho = {0.95}, hidden = {50} |
EDM_3 | epsilon = 1e-9, rho = {0.9}, hidden = {100} |
And carrying out secondary aggregation on the 15 single-feature sub-models to obtain the average value of the results of each sub-model, wherein the prediction effect of the secondary aggregation model is improved compared with that of the single-feature model, 90% sensitivity of a training set is adopted as a model to predict the section values of benign breast nodules and malignant breast tumors, and finally the prediction effect AUC of the training set reaches 91.2%, the sensitivity of the training set is 90%, and the specificity of the training set is 76.3%. The predicted effect AUC for the validation set was 89.3%, validation set specificity was 85.9% and validation set sensitivity was 89.8%.
Claims (2)
1. The application of the gene marker in preparing an early screening detection reagent for malignant breast cancer and benign breast nodules is characterized in that the gene marker comprises the following components:
a first marker: copy number in different windows on chromosomes in WGS data;
a second marker: comparing the cfDNA fragments to the short read number ratio and the long read number ratio in different windows of the reference genome; the base length of the short reading segment is 100-150bp, and the base length of the long reading segment is 151-220bp;
third marker: comparing cfDNA fragments to the number of reads in different length gradient intervals on the long and short arms of the reference genome; the gradient intervals with different lengths are gradient ranges with different lengths, which are obtained by increasing the gradient ranges with 5bp step sizes in the range of 100-220 bp; the long and short arms are selected from the following chromosome arms:
chr1_p, ch4_q, chr8_p, ch11_q, chr16_q, ch20_p, ch1_q, ch5_p, chr8_q, chr12_p, ch17_p, ch20_q, ch2_p, ch5_q, ch9_p, ch12_q, ch17_q, ch21_q, ch2_q, ch6_p, ch9_q, ch13_q, ch18_p, ch22_q, chr3_p, ch6_q, ch10_p, ch14_q, ch18_q, chr3_q, ch7_p, ch10_q, ch15_q, ch19_p, ch4_p, ch7_q, ch11_p, ch16_q, ch19_q; wherein the character chr and the following digits represent chromosome numbers, q represents long arms, and p represents short arms;
fourth marker: comparing the duty ratio of the cfDNA fragments in all base sequences of n bp respectively at the upstream and downstream of the breakpoint on the reference genome;
fifth marker: comparing the ratios of the cfDNA fragments of different types to the m base fragments at the 5' end of the reference genome in all base fragments;
the first marker is obtained through the following steps: dividing a reference genome into a plurality of windows, and respectively obtaining copy number data in different windows on chromosome 1-22 in WGS data; the window size is 0.8-1.2M;
the second marker is obtained through the following steps: dividing a reference genome into a plurality of windows, and respectively counting the proportion of cfDNA fragments in all the windows, wherein the proportion is smaller than that of cfDNA fragments in all the windows;
the third marker is obtained through the following steps: comparing cfDNA fragments to a reference genome, taking a long arm and a short arm on each chromosome as regional ranges, and obtaining the number of reads in gradient intervals with different lengths in each range;
the fourth marker is obtained through the following steps: comparing cfDNA fragment data results to a reference genome to obtain the position of the 5' end of the read on the reference genome; obtaining sequence data of n bp bases at the upstream and downstream of the position as a base fragment set; taking the proportion of the obtained various base fragments in all fragments as a fourth characteristic set;
the fifth marker is obtained through the following steps: taking m base data of the 5' end of the cfDNA fragment as a base fragment set, and obtaining the proportion of various base fragments in all fragments; n is 4 and m is 8.
2. The method for constructing the malignant breast cancer screening model is characterized in that the model is used for classifying malignant breast cancer and benign breast nodules of a sample and comprises the following steps of:
step 1, extracting cfDNA from samples of a malignant breast cancer patient and a control group and sequencing to obtain cfDNA fragmentation information;
step 2, dividing a reference genome into a plurality of windows, and respectively obtaining copy number data in the range of each window as a first characteristic value;
step 3, comparing the result of the read data to a reference genome, dividing the reference genome into a plurality of windows, and respectively obtaining the short read quantity ratio and the long read quantity ratio in the range of each window, wherein the base length of the short read is 100-150bp, and the base length of the long read is 151-220bp;
step 4, comparing the read data result to a reference genome, taking a long arm and a short arm on each chromosome as regional ranges, and obtaining the number of reads in gradient intervals with different lengths in each range as a third feature set; the gradient intervals with different lengths are gradient ranges with different lengths, which are obtained by increasing the gradient ranges with 5bp step sizes in the range of 100-220 bp; the long and short arms are selected from the following chromosome arms:
chr1_p, ch4_q, chr8_p, ch11_q, chr16_q, ch20_p, ch1_q, ch5_p, chr8_q, chr12_p, ch17_p, ch20_q, ch2_p, ch5_q, ch9_p, ch12_q, ch17_q, ch21_q, ch2_q, ch6_p, ch9_q, ch13_q, ch18_p, ch22_q, chr3_p, ch6_q, ch10_p, ch14_q, ch18_q, chr3_q, ch7_p, ch10_q, ch15_q, ch19_p, ch4_p, ch7_q, ch11_p, ch16_q, ch19_q; wherein the character chr and the following digits represent chromosome numbers, q represents long arms, and p represents short arms;
step 5, comparing the read data result to a reference genome to obtain the position of the 5' end of the read on the reference genome; obtaining sequence data of n bp bases at the upstream and downstream of the position as a base fragment set; taking the proportion of the obtained various base fragments in all fragments as a fourth characteristic set;
step 6, taking m base data of the 5' end in the read data as a base fragment set, and obtaining the proportion of various base fragments in all fragments as a fifth characteristic set;
step 7, taking the first, second, third, fourth and fifth feature sets as initial feature values together, inputting the initial feature values as model feature vectors into a classification model, and training the model by taking the classification results of malignant breast cancer and benign nodules as output values to obtain a malignant breast cancer screening model;
the window in the step 2 is obtained by dividing the reference gene No. 1-22 chromosomes in a non-overlapping way with the length of 0.8-1.2 Mb;
the step 3 includes:
step 3-1, dividing a reference genome into a plurality of windows according to the length of 5 Mb;
step 3-2, counting the proportion of the cfDNA fragments in all the cfDNA fragments in each window relative to the cfDNA fragments of the short reading segment and the long reading segment in the window;
the number of the read sections in the step 4 is subjected to standardized treatment;
in the step 5, n is 4; in the step 6, m is 8;
in the step 7, inputting the first, second, third, fourth and fifth feature sets into a generalized linear regression model, a gradient lifting model, a random forest, a deep learning neural network model and an extreme gradient lifting model respectively to obtain a plurality of sub-models, and combining the sub-models into a linear relation model;
in the process of obtaining a plurality of sub-models, the sub-models obtained by screening are applied to the linear relation model after screening according to the classification performance of each sub-model of the first, second, third, fourth and fifth feature sets.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310446774.6A CN116153420B (en) | 2023-04-24 | 2023-04-24 | Application of gene marker in early screening of malignant breast cancer and benign breast nodule and construction method of screening model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310446774.6A CN116153420B (en) | 2023-04-24 | 2023-04-24 | Application of gene marker in early screening of malignant breast cancer and benign breast nodule and construction method of screening model |
Publications (2)
Publication Number | Publication Date |
---|---|
CN116153420A CN116153420A (en) | 2023-05-23 |
CN116153420B true CN116153420B (en) | 2023-08-18 |
Family
ID=86356536
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310446774.6A Active CN116153420B (en) | 2023-04-24 | 2023-04-24 | Application of gene marker in early screening of malignant breast cancer and benign breast nodule and construction method of screening model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116153420B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116403637A (en) * | 2023-06-08 | 2023-07-07 | 深圳市睿法生物科技有限公司 | Model construction method of liver cirrhosis marker |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111863250A (en) * | 2020-08-14 | 2020-10-30 | 中国科学院大学温州研究院(温州生物材料与工程研究所) | Combined diagnosis model and system for early breast cancer |
CN111910004A (en) * | 2020-08-14 | 2020-11-10 | 中国科学院大学温州研究院(温州生物材料与工程研究所) | Application of cfDNA in noninvasive diagnosis of early breast cancer |
US10993653B1 (en) * | 2018-07-13 | 2021-05-04 | Johnson Thomas | Machine learning based non-invasive diagnosis of thyroid disease |
CN114927213A (en) * | 2022-04-15 | 2022-08-19 | 南京世和基因生物技术股份有限公司 | Construction method and detection device of multiple-cancer early screening model |
-
2023
- 2023-04-24 CN CN202310446774.6A patent/CN116153420B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10993653B1 (en) * | 2018-07-13 | 2021-05-04 | Johnson Thomas | Machine learning based non-invasive diagnosis of thyroid disease |
CN111863250A (en) * | 2020-08-14 | 2020-10-30 | 中国科学院大学温州研究院(温州生物材料与工程研究所) | Combined diagnosis model and system for early breast cancer |
CN111910004A (en) * | 2020-08-14 | 2020-11-10 | 中国科学院大学温州研究院(温州生物材料与工程研究所) | Application of cfDNA in noninvasive diagnosis of early breast cancer |
CN114927213A (en) * | 2022-04-15 | 2022-08-19 | 南京世和基因生物技术股份有限公司 | Construction method and detection device of multiple-cancer early screening model |
Also Published As
Publication number | Publication date |
---|---|
CN116153420A (en) | 2023-05-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109872772B (en) | Method for excavating colorectal cancer radiotherapy specific genes by using weight gene co-expression network | |
CN115171779B (en) | Cancer driving gene prediction device based on graph attention network and multiple groups of chemical fusion | |
CN112086129B (en) | Method and system for predicting cfDNA of tumor tissue | |
CN112750502B (en) | Single cell transcriptome sequencing data clustering recommendation method based on two-dimensional distribution structure judgment | |
CN111243673B (en) | Tumor screening model, and construction method and device thereof | |
CN115295074B (en) | Application of gene marker in malignant pulmonary nodule screening, construction method of screening model and detection device | |
CN109801680B (en) | Tumor metastasis and recurrence prediction method and system based on TCGA database | |
CN116153420B (en) | Application of gene marker in early screening of malignant breast cancer and benign breast nodule and construction method of screening model | |
CN109872776B (en) | Screening method for potential biomarkers of gastric cancer based on weighted gene co-expression network analysis and application thereof | |
WO2023197825A1 (en) | Multi-cancer early screening model construction method and detection device | |
CN111370073B (en) | Medicine interaction rule prediction method based on deep learning | |
CN113421608A (en) | Construction method, detection device and computer readable medium of liver cancer early screening model | |
CN115896242A (en) | Intelligent cancer screening model and method based on peripheral blood immune characteristics | |
CN111564177A (en) | Construction method of early non-small cell lung cancer recurrence model based on DNA methylation | |
CN110714078A (en) | Marker gene for colorectal cancer recurrence prediction in stage II and application thereof | |
CN110428899B (en) | Multi-data integration circular RNA and disease correlation prediction method based on double random walk restart | |
CN113903398A (en) | Intestinal cancer early-screening marker, detection method, detection device, and computer-readable medium | |
CN114613430A (en) | Filtering method and computing equipment for false positive nucleotide variation sites | |
CN114373548A (en) | Pancreatic cancer prognosis risk prediction method and device established based on metabolic genes | |
CN115881296B (en) | Thyroid papillary carcinoma (PTC) risk auxiliary layering system | |
Nayak et al. | ReCuRandom: A hybrid machine learning model for significant gene identification | |
CN110942808A (en) | Prognosis prediction method and prediction system based on gene big data | |
CN114093512B (en) | Survival prediction method based on multi-mode data and deep learning model | |
CN113862351B (en) | Kit and method for identifying extracellular RNA biomarkers in body fluid sample | |
CN112382341B (en) | Method for identifying biomarkers related to prognosis of esophageal squamous carcinoma |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |