CN115295074A - Application of gene marker in malignant pulmonary nodule screening, construction method of screening model and detection device - Google Patents

Application of gene marker in malignant pulmonary nodule screening, construction method of screening model and detection device Download PDF

Info

Publication number
CN115295074A
CN115295074A CN202211220583.XA CN202211220583A CN115295074A CN 115295074 A CN115295074 A CN 115295074A CN 202211220583 A CN202211220583 A CN 202211220583A CN 115295074 A CN115295074 A CN 115295074A
Authority
CN
China
Prior art keywords
model
windows
obtaining
samples
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202211220583.XA
Other languages
Chinese (zh)
Other versions
CN115295074B (en
Inventor
邵阳
吴雪
包华
刘睿
吴舒雨
吴旻
杨珊珊
刘思思
郑丽娟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Shihe Medical Devices Co ltd
Nanjing Shihe Gene Biotechnology Co ltd
Original Assignee
Nanjing Shihe Medical Devices Co ltd
Nanjing Shihe Gene Biotechnology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Shihe Medical Devices Co ltd, Nanjing Shihe Gene Biotechnology Co ltd filed Critical Nanjing Shihe Medical Devices Co ltd
Priority to CN202211220583.XA priority Critical patent/CN115295074B/en
Priority to CN202211453553.3A priority patent/CN116052768A/en
Publication of CN115295074A publication Critical patent/CN115295074A/en
Application granted granted Critical
Publication of CN115295074B publication Critical patent/CN115295074B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/10Ploidy or copy number detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Abstract

The invention provides an application of a gene marker in malignant pulmonary nodule screening, a screening model construction method and a detection device, wherein a high-throughput sequencing result is subjected to image high-risk benign and malignant pulmonary nodule difference DNA fragment length ratio, reading 5-end breakpoint sequence reading ratio and 1Mb window copy number change analysis, a 16bp tumor new and short sequence and nucleosome coverage mode, and automatic machine learning is utilized to construct a multi-feature multi-algorithm integration model, so that the benign and malignant pulmonary nodule high-risk pulmonary nodule imaging is predicted, noninvasive accurate diagnosis of malignant pulmonary nodules is realized, and unnecessary benign nodule resection operations are reduced.

Description

Application of gene marker in malignant pulmonary nodule screening, construction method of screening model and detection device
Technical Field
The invention relates to a lung nodule (radiographical high lung cancer lung nodules) good and malignant early screening which shows high risk to images, belonging to the technical field of molecular biomedicine.
Background
Lung cancer is one of the most well-established cancers in the world, and high-risk groups include those aged over 65 years and having one or more of the following risk factors. The risk factors include: severe smoking, once with a history of smoking, family history, receiving thoracic cavity radiation therapy, and carcinogens. Patients are generally diagnosed in the middle and late stages (stage III, IV) of lung cancer due to the lack of overt symptoms in the early stages of lung cancer. However, a number of studies have shown that lung cancer patients diagnosed at an early stage can have a higher survival rate. Patients diagnosed at stage one lung cancer (stage I) had a 13-fold improvement in five-year survival rate over patients diagnosed at stage four lung cancer (stage IV). Therefore, early detection and diagnosis of lung tumors is critical to improving the survival rate of lung cancer patients.
Low-Dose chest Computed Tomography (LDCT) detection of lung nodules is today the most common diagnostic modality for the discovery of lung tumors. The lung nodules determined by imaging are subjected to surgical resection, so that the lung cancer death rate can be effectively reduced by 20% -39%. However, approximately 15% -35% of lung nodules, which are judged as high risk lung nodules in the initial LDCT image presentation, are ultimately identified as pathologically harmless after surgical resection. Therefore, the imaging test has certain limitations, and the diagnosis of the malignant lung tumor is performed only according to the result of the imaging test, thereby increasing some unnecessary operations, causing unnecessary operation risks and complication risks to the patient, and increasing the burden of medical expenses. Therefore, it is important to judge the benign or malignant lung nodules of a high-risk group of lung cancer based on only imaging.
Disclosure of Invention
The technical problem to be solved by the invention is as follows: in the prior art, a noninvasive detection means is absent in the process of diagnosing the benign and malignant lung cancer nodules, so unnecessary operations are caused, and the burden of patients is increased.
In the technical scheme of the patent, WGS sequencing is provided for plasma sample cfDNA, fragment information is obtained by performing high-throughput sequencing on a result, and a difference DNA Fragment length ratio (Fragment size ratio) of benign and malignant pulmonary nodules, a read ratio of 5-end breaking point sequences (break point motif), a 1MB window copy number variation (1 MB-bin copy number variance), a 16bp tumor new short sequence (16 bp neometers) and a Nucleosome coverage pattern (Nucleosome coverage patterns) are performed, and an elastic network logistic regression model (glm), extreme gradient integration (extreme gradient boosting, xgboost), a random forest (random forest) and a neural network (neural network) are utilized to construct a multi-feature multi-node precision model by utilizing an automatic machine learning model, so as to realize a noninvasive diagnosis of the malignant pulmonary nodules.
The specific technical scheme is as follows:
the application of the gene marker in preparing a malignant pulmonary nodule screening reagent;
the gene marker comprises:
a first marker: comparing the cfDNA fragments to the number of short reads and the number of long reads in different windows of the reference genome;
a second marker: the proportion of m base fragments aligned to the 5' end of the reference genome of different kinds of cfDNA fragments among all the base fragments;
a third marker: copy number in different windows on chromosomes in WGS data;
a fourth marker: tumor new short sequence proportion;
a fifth marker: nucleosome coverage pattern.
The first marker is obtained by the following steps: and comparing the reading data results to a reference genome, dividing the reference genome into a plurality of windows, and respectively obtaining the ratio of the number of short reads to the number of ultra-long reads in each window range.
The second marker is obtained by the following steps: taking m base data of the 5' end in the reading data as a base fragment set, and obtaining the proportion of various base fragments in all the fragments.
The third marker is obtained by the following steps: the reference genome is divided into a plurality of windows, and copy number data of the WGS data in different windows on chromosomes 1-22 are obtained separately.
The fourth marker is obtained by the following steps:
generating a short sequence set A with the length of 16bp by an exhaustion method; exhausting all short sequence sets B with the length of 16bp in the human reference gene sequence, and defining the sets B as invalid seeds after removing the data of the sets B from the sets A;
obtaining WGS sequencing results of samples of different cancer species from a cancer database, and extracting base substitution mutation which appears for many times; according to the positions of the base substitutions, finding an invalid subsequence set C containing the base substitutions from the invalid seeds;
obtaining base substitution mutation with frequency more than 0.01 in east Asia population; according to the positions of the base substitutions, finding an invalid subsequence set D containing the base substitutions from the invalid seeds; eliminating invalid subsequences of the set D from the set C, and defining the invalid subsequences as new short sequences;
counting the number of samples in the samples which can read any new short sequence, searching the number of samples containing the new short sequences aiming at each new short sequence, and comparing the number of samples of each new short sequence with the total number of samples which can read any new short sequence.
The fifth marker is obtained by the following steps:
obtaining transcription factors from a GTRD database, and excluding the transcription factors which do not have known transcription sites in a CIS-BP database;
taking the range of-5 kb to +5kb near the transcription site of the obtained transcription factor as windows, obtaining fragments with the length of 100-220bp which can be compared to the windows, and sequentially carrying out GC correction and sequencing deep smoothing treatment on the read data in the windows to obtain a coverage mode curve of each transcription factor;
for each transcription factor, the following three features were obtained, collectively as the nucleosome coverage pattern:
1) For all transcription sites of the transcription factor, calculating the average depth from the upper end 1kb to the lower end 1kb of the transcription sites;
2) For the obtained coverage pattern curve, obtaining an amplitude value of a curve trough as the center depth of the transcription factor;
3) And performing fast Fourier transform on the obtained coverage mode curve to obtain the amplitude value of the highest point of the nucleosome amplitude signal.
The construction method of the malignant pulmonary nodule screening model comprises the following steps:
step 1, extracting cfDNA from samples of a positive group and a control group and sequencing to obtain reading data;
step 2, comparing the reading data results to a reference genome, dividing the reference genome into a plurality of windows, and respectively obtaining the short reading quantity ratio and the long reading quantity ratio in each window range as a first feature set;
step 3, taking m base data of the 5' end in the reading data as a base fragment set, and obtaining the proportion of various base fragments in all the fragments as a second characteristic set;
step 4, dividing the reference genome into a plurality of windows, and respectively obtaining copy number data in each window range as a third feature set;
step 5, taking the ratio of the number of samples for reading the 16bp new short sequence to the total number of samples for reading any new short sequence as a fourth feature set;
step 6, using the nucleosome coverage pattern characteristics of the selected transcription factor as a fifth characteristic set;
and 7, taking the first, second, third, fourth and fifth feature sets as initial feature values together, taking the initial feature values as model feature vectors to be input into the classification model, and taking the benign and malignant lung nodules as output values to train the model to obtain the early-screening model.
The step 3 comprises the following steps:
step 3-1, dividing the reference genome into a plurality of windows, and respectively obtaining the number of long reads and the number of short reads within the range of each window;
and 3-2, standardizing the short reading number and the long reading number of all windows in the step 3-1, and taking the ratio of the standardized short reading number and long reading number as a first characteristic value.
In the step 3-1, the window size is 5Mb, and 541 windows are divided.
The short read is 100-150bp in length, and the long read is 151-220bp.
In the step 3, m is 4.
In the step 4, the window size is 1Mb, and 2475 windows are divided.
In step 5, the step of obtaining the fourth feature set is as follows:
step 5-1, generating a short sequence set A with the length of 16bp by an exhaustion method; exhausting all short sequence sets B with the length of 16bp in the human reference gene sequence, and defining the sets B as invalid seeds after removing the data of the sets B from the sets A;
step 5-2, obtaining WGS sequencing results of samples of different cancer species from a cancer database, and extracting base substitution mutation appearing for many times; according to the positions of the base substitutions, finding an invalid subsequence set C containing the base substitutions from the invalid seeds;
step 5-3, obtaining base substitution mutation with frequency more than 0.01 in the east Asia population; according to the positions of the base substitutions, finding an invalid subsequence set D containing the base substitutions from the invalid subsequence; eliminating invalid subsequences of the set D from the set C, and defining the invalid subsequences as new short sequences;
and 5-4, counting the number of samples which can read any new short sequence in the samples, searching the number of samples containing the new short sequences aiming at each new short sequence, and taking the ratio of the number of samples of each new short sequence to the total number of samples which can read any new short sequence as a fourth feature set of the model.
The cancer database is a PCAWG database.
The different cancer species include intestinal cancer, lung cancer, breast cancer, gastric cancer, prostate cancer and liver cancer.
Base substitution mutations in the east asian population were obtained from the gnomAD database.
The step 6 comprises the following steps:
step 6-1, obtaining transcription factors from a GTRD database, and excluding the transcription factors which do not have known transcription sites in a CIS-BP database;
step 6-2, taking the range of-5 kb to +5kb near the transcription site of the transcription factor obtained in the step 6-1 as a window, obtaining fragments with the length of 100-220bp which can be compared to the windows, and sequentially carrying out GC correction and sequencing deep smoothing treatment on the read data in the window to obtain a coverage mode curve of each transcription factor;
step 6-3, for each transcription factor, the following three features were obtained, collectively as nucleosome coverage pattern features:
1) For all transcription sites of the transcription factor, calculating the average depth from the upper end 1kb to the lower end 1kb of the transcription sites;
2) For the obtained coverage pattern curve, obtaining an amplitude value of a curve trough as the center depth of the transcription factor;
3) And performing fast Fourier transform on the obtained coverage mode curve to obtain the amplitude value of the highest point of the nucleosome amplitude signal.
In step 7, the step of classifying the model includes:
step 7-1, inputting the first, second, third, fourth and fifth feature sets into different classifier models respectively, training the models, and obtaining one or more optimal classifier models respectively aiming at the first, second, third, fourth and fifth feature sets;
and 7-2, performing secondary set training on the optimal classifier models of the first, second, third, fourth and fifth feature sets obtained in the step 7-1 to construct an integrated classifier model.
The different classifier models are selected from elastic net regression (glm), extreme gradient boosting (xgboost), random forest (random forest), and deep learning neural network (deep learning, NN).
In the quadratic ensemble training, a Generalized Linear Model (GLM), a polar gradient boost Xgboost Model or a deep learning regression Model is used.
A malignant lung nodule detection apparatus comprising:
the sequencing module is used for extracting and sequencing cfDNA of the samples of the positive group and the control group to obtain reading data and WGS sequencing data;
the first characteristic acquisition module is used for comparing the reading data results to a reference genome, dividing the reference genome into a plurality of windows, and respectively obtaining the ratio of the number of short reads to the number of ultra-long reads in each window range as a first characteristic set;
the second characteristic acquisition module is used for taking m base data of the 5' end in the reading data as a base fragment set and obtaining the proportion of various base fragments in all the fragments as a second characteristic set;
the third characteristic acquisition module is used for dividing the reference genome into a plurality of windows and respectively acquiring copy number data of the WGS data in different windows on the chromosome as a third characteristic set;
the fourth characteristic acquisition module is used for taking the proportion of the number of samples which read the 16bp new short sequence to the total number of all samples which can read any new short sequence as a fourth characteristic set;
the fifth characteristic acquisition module analyzes the nucleosome coverage pattern characteristic of the selected transcription factor to serve as a fifth characteristic set;
and the prediction module is used for taking the first, second, third, fourth and fifth feature sets as initial feature values together, taking the initial feature values as model feature vectors to be input into the classification model, taking the benign and malignant lung nodules as output values, training the model and obtaining the early-screening model.
Drawings
FIG. 1 is a schematic diagram of a model building process;
FIG. 2 is a difference distribution plot of the most contributing features among the various features;
FIG. 3 is a graph of AUC curves for models using individual features and models using all of the features in the training set alone;
FIG. 4 is an AUC curve for a model using all features on the validation set;
FIG. 5 is a graph of the predicted score of a classifier after the set of all models on the training set;
FIG. 6 is a graph of the predicted score of the classifier after validating all of the sets of models on the set.
Detailed Description
The calculation method of the invention is detailed as follows:
the invention firstly needs to carry out the steps of extraction, library construction, sequencing and the like of cfDNA from blood samples. The extraction and library construction method is not particularly limited, and can be adjusted from the extraction methods in the prior art. The base information of cfDNA can be obtained using a sequencing technique in the related art in the sequencing process here.
The purpose of the model in this patent is to distinguish between benign lung nodules (benign lung nodules) and malignant lung nodules (malignant lung nodules). Classifying samples, and regarding the lung nodule patients judged to be high-risk in LDCT image detection, taking the patients judged to be benign lung nodules according to subsequent postoperative pathology as a control group, and taking the patients judged to be malignant lung nodules as a positive group.
The data set conditions adopted in the model construction process of the invention are as follows:
Figure 423357DEST_PATH_IMAGE001
methods for extraction and sequencing of plasma cfDNA samples:
before LDCT image diagnosis, a liquid biopsy is carried out on a patient. A purple blood collection tube (EDTA anticoagulation tube) is adopted to collect 10ml of whole blood sample of a patient, plasma is timely centrifugally separated (within 2 hours), and the sample is transferred to a laboratory for analysis under the condition of refrigeration and preservation at the temperature of 80 ℃ below zero. After transport to the laboratory, plasma samples were subjected to ctDNA extraction using QIAGEN plasma DNA extraction kit according to the instructions. And establishing a library for the collected cfDNA sample, and performing WGS-5-times sequencing. After off-line data is obtained, the data is compared to a human reference genome, and base number data information of corresponding reads is obtained.
The model establishing process of the patent mainly comprises the following steps:
step 2, extracting and sequencing cfDNA of the samples of the positive group and the control group to obtain read data;
step 3, comparing the reading data results to a reference genome, obtaining the number of the reading in different length intervals in different window ranges on the reference genome, and taking the ratio of the number of the reading in different lengths as a first characteristic value;
step 4, comparing the reading data result to a reference genome to obtain the position of the 5' end of the reading on the reference genome; obtaining sequence data of m bp bases at the upstream and downstream of the position as a base fragment set; taking the proportion of all the obtained base fragments in all the fragments as a second characteristic value;
step 5, dividing the reference genome into a plurality of windows, and respectively obtaining copy number data in the range of each window as a third characteristic value;
step 6, taking the ratio of the read reading of the 16bp new short sequence to the total reading as a fourth characteristic;
step 7, analyzing the nucleosome coverage mode of the selected transcription factor as a fifth characteristic;
step 8, inputting the model feature vectors of the samples of the positive group and the control group into a first layer model, selecting 5 models with the best AUC corresponding to each feature, and selecting 25 feature models in the first layer model;
step 9, inputting the 25 models selected in the step 9 into a second-layer integrated model, outputting three integrated models with the top AUC sequence, and taking the average value of the predicted possibility results output by the three integrated models as a final judgment result;
the total five characteristic values in this patent are detailed as follows:
DNA Fragment Size Ratio (FSR)
For the size ratio of DNA fragments, it reflects the distribution characteristics of the length size of cfDNA reads in benign and malignant tumors. Performing machine learning to establish a prediction model by using the ratio of the short DNA fragment to the long DNA fragment, and later (benign lung nodes) and malignant lung nodules (malignant lung nodes);
the cfDNA read length data is obtained by the following method: among the aligned bams, quality, length and alignment position information for each read was recorded, and the human reference genome was selected from the hg19 sequence provided by University of California, cruz, UCSC. Cutting a human reference genome into 541 windows according to the length of 5Mb, respectively counting the number of short reads (100 bp-151 bp) and the number of long reads (151 bp-220 bp) in each window, and respectively carrying out standardization conversion on the number of each read according to the counting result of various reads in all windows, wherein the standardization value is = (original value-average value)/standard deviation. This results in 541 sets of numbers of reads of different lengths.
2. Number of reads at Breakpoint of 5' end of reads in terms of percentage of sequence reads (Breakpoint Motifs, BPM)
The human reference genome is a DNA double-helix structure and is linked by hydrogen bond through base complementary pairing; in the normal aging and cancer progression processes, the pH value of the environment around the five cells changes, so that the complementary hydrogen bonds of the basic groups are destroyed and the breakage occurs; the percentage of sequences containing information about sequences at different breakpoints will also vary due to differences in base sequences at the breakpoints. The collection method comprises the following steps: in the aligned bam, the basic information and the aligned position of each read are recorded, 4bp sequences around a breakpoint of a human reference genome sequence coordinate where the 5' end of each read is located are confirmed, the number of reads of 8bp length sequences (4 × 8=65536 in total) at each breakpoint is counted, and 65536 breakpoint sequence read ratios are calculated, for example, AAAAAAAA read ratio = aaaaaaaaaaaa read number/total number of sequence reads at all breakpoints.
3.1 Mb Window Copy Number Variation (1 Mb-Bin Copy Number Variation, CNV)
Copy number changes are highly correlated with individual cancers, and although it has been possible to distinguish between partial cancer-associated genes or specific genomic intervals by detecting copy number changes, other rare or unknown genes or intervals may provide information on potential copy number changes. The collection method comprises the following steps: firstly, collecting WGS data of 30 healthy people, dividing a reference gene chromosome 1-22 into windows in a length of 1Mb in a non-overlapping mode, calculating the read depth in each window for each sample by using a bdtools coverage, correcting according to the GC content and average comparison capability record (UCSC BigWig file) of each window, and taking the median depth of 30 healthy people in each window as a representative to obtain a group comparison baseline with the read depth of 2475 windows; for each sample to be detected, 2475 pieces of window individual reading depth information are obtained in the same way, and the copy number change logarithm of each window, namely log2 (the depth of the sample to be detected after being corrected and homogenized/the depth of the group baseline after being corrected and homogenized) is constructed by using a Hidden Markov Model (HMM) and the group baseline contrast depth of each window, so that the copy number change information of each sample to be detected is obtained.
4.16bp tumor short sequence (16 bp Neomers, NEO)
Nullsomers (nullomers) refer to short sequences of DNA not present in the human genome, and 16bp tumor-associated novel short sequences are a subset of nullsons, and specifically null bodies of 16bp in length, which are not present in the human genome but are repeatedly found in the genome of tumor tissue.
The characteristic value is obtained in the following way:
first, an exhaustive method was used to generate all possible short sequence sets A of 16bp in length. And in a human reference gene sequence (hg 19 version), 1bp is used as a sliding window, an exhaustive algorithm is used for searching all short sequence sets B with the length of 16bp and the occurrence times thereof, and the 16bp short sequences appearing in the set A are defined as the nulling seeds.
The present patent focuses on obtaining the WGS mutation results of 2577 patients with 6 different types of cancer (intestinal cancer, lung cancer, breast cancer, gastric cancer, prostate cancer, liver cancer) by analyzing the PCAWG database (https:// dcc. Icgc.org/PCAWG), and extracting 977 multiple base substitutions (occurring at least twice) from them. And extracting a set C of all possible invalid sub-short sequences comprising the base substitution by using an exhaustive method according to the position of the base substitution.
Through a gnomAD (https:// gnomAD. Broadproperty. Org /) database, mutation sites of base substitution with the frequency of east Asian population being more than 0.01 are collected, and according to the positions of the mutation sites, a short sequence set D containing the mutation sites is found from an invalid seed and is used as a common invalid seed set for collecting the east Asian population. And eliminating the invalid subsequence of the set D from the set C to obtain 4616 new 16bp tumor-related short sequences. For the 4616 new short sequences, firstly, the number of samples capable of reading any one 4616 new short sequences in the samples is counted, then, for each new short sequence, the number of samples containing the new short sequences is searched, and the ratio of the number of samples of each new short sequence to the total number of samples capable of reading any new short sequence (4616 ratio values in total) is used as the fourth feature of the model.
5. Nucleosome coverage Pattern (Nucleosome coverage patterns, NCP)
The transcription factors are selected from GTRD database (https:// rd. Bieuml. Org/# |) (v 21.12), the transcription factors which do not have known transcription sites in CIS-BP database (http:// cisbp. Ccbr. Utono. ca /) (v 2.00) are excluded, and 334 transcription factors with more than 10000 high-matching sites are selected.
For the transcription factors obtained above, using the range of-5 kb to +5kb near the transcription site in these target transcription factors as windows, fragments of 100-220bp in length that can be aligned into these windows were obtained. For these fragments, GC correction was performed and the final coverage pattern curve for each transcription factor was obtained using a Savitzky-Golay filter flattening curve with a polynomial power of 3 for the sequencing depth.
After the overlay pattern curve described above is obtained, three features are extracted for each transcription factor:
1) For all transcription sites of the transcription factor, calculating the average depth from the upper end 1kb to the lower end 1kb of the transcription sites;
2) The center depth of the transcription factor;
3) And performing fast Fourier transform on the obtained coverage mode curve to obtain an amplitude value of the highest point of the nucleosome amplitude signal.
These three features are collectively taken as the eigenvalues of the nucleosome coverage pattern.
Through the above data acquisition, initial data vectors of the five types of data can be obtained, respectively. Then, a corresponding calculation method is designed, in the patent, a conventional classifier algorithm can be adopted to classify through the characteristic values, the characteristic values are input into a classifier, and the probability value of malignant lung nodules is used as output. The classifier models adopted by the optimization in the patent comprise the following four types, and when the classifier is optimized, sub models with different model parameters under the same model are simultaneously generated for screening the sub models. The four main models include:
1. elastic network regression model (glm)
The elastic network regression model is a common algorithm in machine learning, is a model for fitting generalized linear simulation by punishing maximum likelihood, and combines L2 regularization of ridge regression and L1 regularization algorithm of LASSO regression. The regularization path is calculated for a lasso or elastic network penalty at the value grid of the regularization parameter λ, solving the over-fitting problem in the regression. The hyperparameter alpha controls the distribution of regularized L1, L2.
2. Extreme gradient boost (xgboost)
The optimization algorithm is an optimization algorithm of an integrated thought addition model based on a Gradient Boosting Decision Tree (GBDT). The method is developed by using a second-order Taylor formula, a loss function is optimized, the calculation accuracy is improved, a model is simplified by using a regular term, overfitting is avoided, and parallel calculation can be performed by using a Blocks storage structure.
The XGboost and learner used in the patent is a tree model. When the depth of the tree increases, the complexity of the tree increases, the model can be better trained, and the overfitting problem can also be caused, the hyperparameter max _ depth is used for controlling the maximum depth of the tree, and the hyperparameter min _ rows is used for controlling the minimum number of samples of each leaf node.
3. Random forest (random forest)
Random forests are a powerful classification and regression tool for high-dimensional and multicollinearity situations. When a group of data sets are provided, the random forest can randomly extract partial information to generate a group of decision-making forests for assisting classification or regression, node splitting attributes are made, and random extraction is continuously repeated until splitting cannot be performed; and finally, combining all the split attribute results to obtain a final prediction result. The random forest also controls the complexity of the tree through the super parameters such as the super parameter max _ depth, min _ rows and the like.
4. Deep learning neural network model (deep learning, NN)
The neural network consists of inputs, weights, biases or thresholds and outputs, and the output of any single node is above a specified threshold, then that node is activated and the data is sent to the next layer of the network. Each node of the input layer and each node of the hidden layer are subjected to point-to-point calculation by using a weighted summation and activation method. Each value calculated using the hidden layer is calculated using the same method, and using the output layer. The method has the advantages of high classification accuracy, strong parallel distribution processing capability and strong distribution storage and learning capabilities.
The deep learning neural network model used in the patent is a multi-layer feedforward neural network (multi-layer fed forward neural network), the neural network structure of the feedforward network is a foremost input layer, a middle hidden layer and a final output layer, the middle can contain a plurality of hidden layers (hidden layers), network signals are transmitted from the input layer to the output layer in a one-way mode, and each layer of neurons is only connected with the neurons of the previous layer to receive information of the previous layer. The deep learning neural network is optimized and trained by adopting a stochastic gradient descent algorithm, estimates the error gradient of the current state of the model by using data in a training data set, and then updates the weight of the model by using the error. The amount of updating weights during training is the learning rate (epsilon) used to control the speed of model adaptation to the problem, the learning rate decay index (rho), and also the rate of model learning, are all important parameters that can be configured in neural network training.
In addition, the patent also adopts a Random Grid Search Parameters algorithm for optimizing the model.
Random search is a common method of machine learning hyperparametric optimization. The random search is to randomly extract parameter values in a specific model parameter range and select an optimal parameter combination from a plurality of sampled parameter values. Rather than trying all possible combinations, the method selects a certain number of random combinations of a random value for each hyper-parameter. Compared with manual tuning and network search for model tuning, random search can achieve a good effect with fewer search times and provides a more efficient solution (especially under the condition of a large number of parameters).
In the implementation process of the four algorithms in the patent, the following \8230; (algorithm type) (or which algorithm toolkit is called at once) is adopted specifically
The hyper-parameters of the four algorithms used in this patent are shown in the following table:
Figure 971013DEST_PATH_IMAGE002
after acquiring the five types of initial data information of 247 cases of malignant lung nodules and 60 cases of benign lung nodules of patients, using a Fragment Size Ratio (FSR) statistical result as an input value (an input vector of each sample comprises a characteristic value formed by 541 read fragment size ratios), classifying the malignant lung nodule samples and the benign lung nodule samples through four classification models respectively, in the screening process, randomly searching hyper-parameters of the four types of models to change parameters and structures, using the parameters and the structures as submodels to train data and suggest the submodels, and selecting five optimal submodels of the characteristics, wherein an AUC curve of a training set of the models is used as an index of a classification effect in the screening process; similarly, after collecting the data proportion information of the breakpoint sequence reading at the 5 'end of the DNA fragment of the benign pulmonary nodule patient and the malignant pulmonary nodule patient, classifying the malignant pulmonary nodule sample and the benign pulmonary nodule sample through five characteristics by using the breakpoint sequence proportion at the 5' end of the DNA fragment (65536 kinds) as an input value through four classification models, and selecting five optimal submodels of the characteristics (the specific model optimization and the hyper-parameter adjustment process are the same as above). Similarly, copy number variation (2475), new short sequences (4616) and nucleosome coverage patterns (1002) are also used as input values, classified by four types of models respectively, and the optimal five submodels are selected for each feature (the specific model optimization process is the same as above). Through the above calculation process, a total of 5 × 5=25 model calculation results are obtained. In each calculation, the contribution value of each feature vector to the classification result may be obtained.
The 5 optimal models (25 models in total) selected by each feature are respectively as shown in the following table:
Figure 203412DEST_PATH_IMAGE003
Figure 709479DEST_PATH_IMAGE004
the contribution values of the optimal model selected by each feature are ranked as follows:
DNA Fragment Size Ratio (FSR) Deep Learning neural network model (Deep Learning, NN):
Figure 378227DEST_PATH_IMAGE005
Figure 995153DEST_PATH_IMAGE006
2.5' end break point sequence (BPM) read elastic network regression model (GLM):
Figure 82058DEST_PATH_IMAGE007
Figure 759027DEST_PATH_IMAGE008
3. copy Number Variation (CNV) Deep Learning neural network model (Deep Learning, NN)
Figure 931382DEST_PATH_IMAGE009
Figure 86420DEST_PATH_IMAGE010
4.16bp tumor New short sequence (NEO) XgBoost model:
Figure 293410DEST_PATH_IMAGE011
Figure 390548DEST_PATH_IMAGE012
5. nucleosome Coverage Pattern (NCP) XgBoost model:
Figure 50200DEST_PATH_IMAGE013
Figure 743349DEST_PATH_IMAGE014
in order to further improve the prediction performance of the classifier, secondary set training (stacking) is carried out on the 25 training model results. Stacking is an ensemble learning technique by applying 25 low-level classifiers (1) st -level base model) again with meta learning (2) nd Level meta-learning), collecting the characteristics of each bottom-layer classifier, and finding out an optimal integration mode, thereby improving the model prediction performance. Finally, the training algorithm used by the Stacking system is a Generalized Linear Model (GLM), a polar gradient boost Xgboost Model and a deep learning regression Model.
The 3 optimal integration models and feature model importance (variables import) are shown in the following table:
Figure 539267DEST_PATH_IMAGE015
Figure 558038DEST_PATH_IMAGE016
the optimal stacking model with the highest AUC is a Generalized Linear Model (GLM), a relation between a mathematical expectation value of a response variable and a prediction variable of a linear combination is established through a coupling function, and the 25 training models are converted into a final linear equation: ALL Stacked + A + CNV model 1 + B + CNV model 2 + C + CNV model 3 + D + CNV model 4 + E + CNV model 5+ F + BPM model 1 + G + BPM model 2 + H + BPM model 3 + I + BPM model 4 + J + BPM model 5+ K + FSR model 1 + L + FSR model 2 + M + FSR model 3 + N + FSR model 4 + O + R model 5+ P + O model 1 + Q + NEO model 2 + R NEO model 3 + S + NEO model 4 + T + NCO model 5+ NCP + NCV model 3 + P + NCV model, intercept, A-I are both linear equation parameters.
The specific coefficients are as follows:
Figure 704986DEST_PATH_IMAGE017
Figure 467406DEST_PATH_IMAGE018
each feature has a certain prediction effect under different training algorithms, and the secondary set training single feature improves the prediction effect of the feature. And finally, the AUC of the prediction result for the training set is up to 0.9474, the AUC of the prediction result for the verification set is up to 0.931, the sensitivity is 85%, and the specificity is 98.7%.

Claims (10)

1. The application of the gene marker in preparing a malignant pulmonary nodule screening reagent is characterized in that the gene marker comprises:
a first marker: comparing the cfDNA fragments to the number of short reads and the number of long reads in different windows of the reference genome;
a second marker: the proportion of m base fragments aligned to the 5' end of the reference genome of different kinds of cfDNA fragments among all the base fragments;
a third marker: copy number in different windows on chromosomes in WGS data;
fourth marker: tumor new short sequence proportion;
fifth marker: nucleosome coverage pattern.
2. The use of claim 1, wherein said first marker is obtained by: comparing the reading data results to a reference genome, dividing the reference genome into a plurality of windows, and respectively obtaining the ratio of the number of short reads to the number of ultra-long reads within the range of each window;
the second marker is obtained by the following steps: taking m base data of the 5' end in the reading data as a base fragment set, and obtaining the proportion of various base fragments in all fragments;
the third marker is obtained by the following steps: dividing a reference genome into a plurality of windows, and respectively obtaining copy number data in different windows on chromosomes 1-22 in WGS data;
the fourth marker is obtained by the following steps:
generating a short sequence set A with the length of 16bp by an exhaustion method; exhausting all short sequence sets B with the length of 16bp in the human reference gene sequence, and defining the sets B as invalid seeds after removing the data of the sets B from the sets A;
obtaining WGS sequencing results of samples of different cancer species from a cancer database, and extracting base substitution mutation which appears for many times; according to the positions of the base substitutions, finding an invalid subsequence set C containing the base substitutions from the invalid subsequence;
obtaining base substitution mutation with frequency more than 0.01 in east Asia population; according to the positions of the base substitutions, finding an invalid subsequence set D containing the base substitutions from the invalid seeds; eliminating invalid subsequences of the set D from the set C, and defining the invalid subsequences as new short sequences;
counting the number of samples in which any new short sequence can be read in the samples, searching the number of samples containing the new short sequences aiming at each new short sequence, and comparing the number of samples of each new short sequence with the total number of samples in which any new short sequence can be read;
the fifth marker is obtained by the following steps:
obtaining transcription factors from a GTRD database, and excluding the transcription factors which do not have known transcription sites in a CIS-BP database;
taking the range of-5 kb to +5kb near the transcription site of the obtained transcription factor as windows, obtaining fragments with the length of 100-220bp which can be compared to the windows, and sequentially carrying out GC correction and sequencing deep smoothing treatment on the read data in the windows to obtain a coverage mode curve of each transcription factor;
for each transcription factor, the following three features were obtained, collectively as the nucleosome coverage pattern:
for all transcription sites of the transcription factor, calculating the average depth from the upper end 1kb to the lower end 1kb of the transcription sites;
for the obtained coverage pattern curve, obtaining an amplitude value of a curve trough as the center depth of the transcription factor;
3) And performing fast Fourier transform on the obtained coverage mode curve to obtain the amplitude value of the highest point of the nucleosome amplitude signal.
3. The method for constructing the malignant lung nodule screening model is characterized by comprising the following steps of:
step 1, extracting cfDNA from samples of a positive group and a control group and sequencing to obtain reading data;
step 2, comparing the reading data results to a reference genome, dividing the reference genome into a plurality of windows, and respectively obtaining the proportion of the number of short reads and the number of ultra-long reads in each window range as a first feature set;
step 3, taking m base data of the 5' end in the reading data as a base fragment set, and obtaining the proportion of various base fragments in all the fragments as a second characteristic set;
step 4, dividing the reference genome into a plurality of windows, and respectively obtaining copy number data of the WGS data in different windows on the chromosome as a third feature set;
step 5, taking the ratio of the number of samples reading the 16bp new short sequence to the total number of samples capable of reading any new short sequence as a fourth feature set;
step 6, analyzing nucleosome coverage pattern characteristics of the selected transcription factor to serve as a fifth characteristic set;
and 7, taking the first, second, third, fourth and fifth feature sets as initial feature values together, taking the initial feature values as model feature vectors to be input into the classification model, and taking the benign and malignant lung nodules as output values to train the model to obtain the early-screening model.
4. The method for constructing a malignant lung nodule screening model according to claim 3, wherein the step 3 comprises: step 3-1, dividing the reference genome into a plurality of windows, and respectively obtaining the number of long reads and the number of short reads within the range of each window; and 3-2, standardizing the short reading number and the long reading number of all windows in the step 3-1, and taking the ratio of the standardized short reading number and long reading number as a first characteristic value.
5. The method as claimed in claim 3, wherein the window size in step 3-1 is 5Mb, and 541 windows are defined.
6. The method as claimed in claim 3, wherein the short reads are 100-150bp long, and the long reads are 151-220bp long;
in the step 3, m is 4;
in the step 4, the window size is 1Mb, and 2475 windows are divided.
7. The method for constructing a malignant lung nodule screening model according to claim 3, wherein the step of obtaining as the fourth feature set is as follows:
step 5-1, generating a short sequence set A with the length of 16bp by an exhaustion method; exhausting all short sequence sets B with the length of 16bp in the human reference gene sequence, and defining the sets B as invalid seeds after removing the data of the sets B from the sets A;
step 5-2, obtaining sample WGS sequencing results of different cancer species from a cancer database, and extracting base substitution mutation which appears for many times; according to the positions of the base substitutions, finding an invalid subsequence set C containing the base substitutions from the invalid seeds;
step 5-3, obtaining base substitution mutation with frequency more than 0.01 in the east Asian population; according to the positions of the base substitutions, finding an invalid subsequence set D containing the base substitutions from the invalid subsequence; eliminating invalid subsequences of the set D from the set C, and defining the invalid subsequences as new short sequences;
step 5-4, counting the number of samples which can read any new short sequence in the samples, searching the number of samples containing the new short sequences for each new short sequence, and taking the ratio of the number of samples of each new short sequence to the total number of samples which can read any new short sequence as a fourth feature set of the model;
the cancer database is a PCAWG database;
the different cancer species include intestinal cancer, lung cancer, breast cancer, gastric cancer, prostate cancer and liver cancer;
base substitution mutations in the east asian population were obtained from the gnomAD database.
8. The method of constructing a malignant lung nodule screening model according to claim 3, wherein the step 6 comprises:
step 6-1, obtaining transcription factors from a GTRD database, and excluding the transcription factors which do not have known transcription sites in a CIS-BP database;
step 6-2, taking the range of-5 kb to +5kb near the transcription site of the transcription factor obtained in the step 6-1 as a window, obtaining fragments with the length of 100-220bp which can be compared to the windows, and sequentially carrying out GC correction and sequencing deep smoothing treatment on the read data in the window to obtain a coverage mode curve of each transcription factor;
step 6-3, for each transcription factor, the following three features were obtained, collectively as nucleosome coverage pattern features:
1) For all transcription sites of the transcription factor, calculating the average depth from the upper end 1kb to the lower end 1kb of the transcription sites;
2) For the obtained coverage pattern curve, obtaining an amplitude value of a curve trough as the center depth of the transcription factor;
3) And performing fast Fourier transform on the obtained coverage mode curve to obtain the amplitude value of the highest point of the nucleosome amplitude signal.
9. The method of claim 3, wherein in step 7, the step of classifying the model comprises:
step 7-1, inputting the first, second, third, fourth and fifth feature sets into different classifier models respectively, training the models, and obtaining one or more optimal classifier models respectively aiming at the first, second, third, fourth and fifth feature sets;
step 7-2, performing secondary set training on the optimal classifier models of the first, second, third, fourth and fifth feature sets obtained in the step 7-1 to construct an integrated classifier model;
the different classifier models are selected from an elastic network regression model, a polar gradient lifting model, a random forest model and a deep learning neural network model; a generalized linear model, a polar gradient boost Xgboost model or a deep learning regression model is adopted in the secondary set training.
10. A malignant lung nodule detection apparatus comprising:
the sequencing module is used for extracting and sequencing cfDNA of the samples of the positive group and the control group to obtain reading data and WGS sequencing data;
the first characteristic acquisition module is used for comparing the reading data results to a reference genome, dividing the reference genome into a plurality of windows, and respectively obtaining the ratio of the number of short reads to the number of ultra-long reads in each window range as a first characteristic set;
the second characteristic acquisition module is used for taking m base data at the 5' end in the reading data as a base fragment set and obtaining the proportion of various base fragments in all the fragments as a second characteristic set;
the third characteristic acquisition module is used for dividing the reference genome into a plurality of windows and respectively acquiring copy number data of the WGS data in different windows on the chromosome as a third characteristic set;
the fourth characteristic acquisition module is used for taking the proportion of the number of samples which read the 16bp new short sequence to the total number of all samples which can read any new short sequence as a fourth characteristic set;
the fifth characteristic acquisition module analyzes the nucleosome coverage pattern characteristic of the selected transcription factor to serve as a fifth characteristic set;
and the prediction module is used for inputting the initial characteristic value which is the first characteristic set, the second characteristic set, the third characteristic set, the fourth characteristic set and the fifth characteristic set into the classification model as a model characteristic vector, and training the model by taking the benign and malignant lung nodules as an output value to obtain the early screening model.
CN202211220583.XA 2022-10-08 2022-10-08 Application of gene marker in malignant pulmonary nodule screening, construction method of screening model and detection device Active CN115295074B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202211220583.XA CN115295074B (en) 2022-10-08 2022-10-08 Application of gene marker in malignant pulmonary nodule screening, construction method of screening model and detection device
CN202211453553.3A CN116052768A (en) 2022-10-08 2022-10-08 Malignant lung nodule screening gene marker, construction method of screening model and detection device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211220583.XA CN115295074B (en) 2022-10-08 2022-10-08 Application of gene marker in malignant pulmonary nodule screening, construction method of screening model and detection device

Related Child Applications (1)

Application Number Title Priority Date Filing Date
CN202211453553.3A Division CN116052768A (en) 2022-10-08 2022-10-08 Malignant lung nodule screening gene marker, construction method of screening model and detection device

Publications (2)

Publication Number Publication Date
CN115295074A true CN115295074A (en) 2022-11-04
CN115295074B CN115295074B (en) 2022-12-16

Family

ID=83834944

Family Applications (2)

Application Number Title Priority Date Filing Date
CN202211453553.3A Pending CN116052768A (en) 2022-10-08 2022-10-08 Malignant lung nodule screening gene marker, construction method of screening model and detection device
CN202211220583.XA Active CN115295074B (en) 2022-10-08 2022-10-08 Application of gene marker in malignant pulmonary nodule screening, construction method of screening model and detection device

Family Applications Before (1)

Application Number Title Priority Date Filing Date
CN202211453553.3A Pending CN116052768A (en) 2022-10-08 2022-10-08 Malignant lung nodule screening gene marker, construction method of screening model and detection device

Country Status (1)

Country Link
CN (2) CN116052768A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115678999A (en) * 2022-12-30 2023-02-03 南京世和基因生物技术股份有限公司 Application of gene marker in non-small cell lung cancer recurrence prediction, detection method of minimal residual lesion and prediction device
CN115984629A (en) * 2023-02-14 2023-04-18 成都泰莱生物科技有限公司 Lung nodule classification method and product based on lung CT and 5mC marker fusion
CN116386718A (en) * 2023-05-30 2023-07-04 北京华宇亿康生物工程技术有限公司 Method, apparatus and medium for detecting copy number variation
CN117352064A (en) * 2023-12-05 2024-01-05 成都泰莱生物科技有限公司 Lung cancer metabolic marker combination and screening method and application thereof

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105518151A (en) * 2013-03-15 2016-04-20 莱兰斯坦福初级大学评议会 Identification and use of circulating nucleic acid tumor markers
CN108753790A (en) * 2018-06-12 2018-11-06 北京市神经外科研究所 With the relevant gene markers of BAVM and its mutation
CN109689891A (en) * 2016-07-06 2019-04-26 夸登特健康公司 The method of segment group spectrum analysis for cell-free nucleic acid
CN110106244A (en) * 2019-06-06 2019-08-09 广州市雄基生物信息技术有限公司 A kind of noninvasive molecule parting kit of breast cancer and method
CN113362897A (en) * 2020-03-06 2021-09-07 福建和瑞基因科技有限公司 Tumor marker screening method based on nucleosome distribution characteristics and application
CN113903398A (en) * 2021-09-08 2022-01-07 南京世和基因生物技术股份有限公司 Intestinal cancer early-screening marker, detection method, detection device, and computer-readable medium
WO2022040163A1 (en) * 2020-08-18 2022-02-24 Delfi Diagnostics, Inc. Methods and systems for cell-free dna fragment size densities to assess cancer

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105518151A (en) * 2013-03-15 2016-04-20 莱兰斯坦福初级大学评议会 Identification and use of circulating nucleic acid tumor markers
CN109689891A (en) * 2016-07-06 2019-04-26 夸登特健康公司 The method of segment group spectrum analysis for cell-free nucleic acid
CN108753790A (en) * 2018-06-12 2018-11-06 北京市神经外科研究所 With the relevant gene markers of BAVM and its mutation
CN110106244A (en) * 2019-06-06 2019-08-09 广州市雄基生物信息技术有限公司 A kind of noninvasive molecule parting kit of breast cancer and method
CN113362897A (en) * 2020-03-06 2021-09-07 福建和瑞基因科技有限公司 Tumor marker screening method based on nucleosome distribution characteristics and application
WO2022040163A1 (en) * 2020-08-18 2022-02-24 Delfi Diagnostics, Inc. Methods and systems for cell-free dna fragment size densities to assess cancer
CN113903398A (en) * 2021-09-08 2022-01-07 南京世和基因生物技术股份有限公司 Intestinal cancer early-screening marker, detection method, detection device, and computer-readable medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
ILIAS GEORGAKOPOULOS-SOARES等: "Leveraging sequences missing from the human genome to diagnose cancer", 《MEDRXIV》 *
PETER ULZ等: "Inference of transcription factor binding from cell-free DNA enables tumor subtype prediction and early detection", 《NATURE COMMUNICATIONS》 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115678999A (en) * 2022-12-30 2023-02-03 南京世和基因生物技术股份有限公司 Application of gene marker in non-small cell lung cancer recurrence prediction, detection method of minimal residual lesion and prediction device
CN115984629A (en) * 2023-02-14 2023-04-18 成都泰莱生物科技有限公司 Lung nodule classification method and product based on lung CT and 5mC marker fusion
CN115984629B (en) * 2023-02-14 2024-02-02 成都泰莱生物科技有限公司 Lung nodule classification method and product based on fusion of lung CT and 5mC marker
CN116386718A (en) * 2023-05-30 2023-07-04 北京华宇亿康生物工程技术有限公司 Method, apparatus and medium for detecting copy number variation
CN116386718B (en) * 2023-05-30 2023-08-01 北京华宇亿康生物工程技术有限公司 Method, apparatus and medium for detecting copy number variation
CN117352064A (en) * 2023-12-05 2024-01-05 成都泰莱生物科技有限公司 Lung cancer metabolic marker combination and screening method and application thereof
CN117352064B (en) * 2023-12-05 2024-02-09 成都泰莱生物科技有限公司 Lung cancer metabolic marker combination and screening method and application thereof

Also Published As

Publication number Publication date
CN116052768A (en) 2023-05-02
CN115295074B (en) 2022-12-16

Similar Documents

Publication Publication Date Title
CN115295074B (en) Application of gene marker in malignant pulmonary nodule screening, construction method of screening model and detection device
CN112086129B (en) Method and system for predicting cfDNA of tumor tissue
EP3201812B1 (en) Predictive test for aggressiveness or indolence of prostate cancer from mass spectrometry of blood-based sample
CN109801680B (en) Tumor metastasis and recurrence prediction method and system based on TCGA database
CN111243673B (en) Tumor screening model, and construction method and device thereof
US20230222311A1 (en) Generating machine learning models using genetic data
WO2023197825A1 (en) Multi-cancer early screening model construction method and detection device
CN114783524B (en) Path abnormity detection system based on self-adaptive resampling depth encoder network
CN115896242A (en) Intelligent cancer screening model and method based on peripheral blood immune characteristics
CN114613430A (en) Filtering method and computing equipment for false positive nucleotide variation sites
CN116153420B (en) Application of gene marker in early screening of malignant breast cancer and benign breast nodule and construction method of screening model
CN111944902A (en) Early prediction method of renal papillary cell carcinoma based on lincRNA expression profile combination characteristics
CN111944900A (en) Characteristic lincRNA expression profile combination and early endometrial cancer prediction method
CN111748634A (en) Characteristic lincRNA expression profile combination and early prediction method of colon cancer
CN110942808A (en) Prognosis prediction method and prediction system based on gene big data
CN116312800A (en) Lung cancer characteristic identification method, device and storage medium based on circulating RNA whole transcriptome sequencing in blood plasma
KR20200109544A (en) Multi-cancer classification method by common significant genes
CN115424666A (en) Method and system for screening pan-cancer early-screening molecular marker based on whole genome bisulfite sequencing data
EP3566054A1 (en) Method for identification of cancer patients with durable benefit from immunotherapy in overall poor prognosis subgroups
CN111793692A (en) Characteristic miRNA expression profile combination and lung squamous carcinoma early prediction method
CN111808965A (en) Characteristic lincRNA expression profile combination and early prediction method of renal clear cell carcinoma
CN111850124A (en) Characteristic lincRNA expression profile combination and lung squamous carcinoma early prediction method
CN113971984A (en) Classification model construction method and device, electronic equipment and storage medium
CN113711313A (en) Predictive test for identifying early-stage NSCLC patients at high risk of relapse after surgery
KR20220133516A (en) Method for detecting tumor derived mutation from cell-free DNA based on artificial intelligence and Method for early diagnosis of cancer using the same

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant