CN115295074B - Application of gene marker in malignant pulmonary nodule screening, construction method of screening model and detection device - Google Patents

Application of gene marker in malignant pulmonary nodule screening, construction method of screening model and detection device Download PDF

Info

Publication number
CN115295074B
CN115295074B CN202211220583.XA CN202211220583A CN115295074B CN 115295074 B CN115295074 B CN 115295074B CN 202211220583 A CN202211220583 A CN 202211220583A CN 115295074 B CN115295074 B CN 115295074B
Authority
CN
China
Prior art keywords
model
samples
obtaining
invalid
windows
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211220583.XA
Other languages
Chinese (zh)
Other versions
CN115295074A (en
Inventor
邵阳
吴雪
包华
刘睿
吴舒雨
吴旻
杨珊珊
刘思思
郑丽娟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Shihe Medical Devices Co ltd
Nanjing Shihe Gene Biotechnology Co ltd
Original Assignee
Nanjing Shihe Medical Devices Co ltd
Nanjing Shihe Gene Biotechnology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Shihe Medical Devices Co ltd, Nanjing Shihe Gene Biotechnology Co ltd filed Critical Nanjing Shihe Medical Devices Co ltd
Priority to CN202211453553.3A priority Critical patent/CN116052768A/en
Priority to CN202211220583.XA priority patent/CN115295074B/en
Publication of CN115295074A publication Critical patent/CN115295074A/en
Application granted granted Critical
Publication of CN115295074B publication Critical patent/CN115295074B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/10Ploidy or copy number detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Medical Informatics (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Genetics & Genomics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Molecular Biology (AREA)
  • Analytical Chemistry (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Chemical & Material Sciences (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Bioethics (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Public Health (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Investigating Or Analysing Biological Materials (AREA)

Abstract

The invention provides application of a gene marker in malignant pulmonary nodule screening, a construction method of a screening model and a detection device, wherein a high-throughput sequencing result is subjected to length ratio of benign and malignant difference DNA fragment fragments of pulmonary nodules with high risk represented by an image, sequence read proportion and 1Mb window copy number change analysis at break points at 5 ends of read, a 16bp tumor new and short sequence and nucleosome coverage mode, and automatic machine learning is utilized to construct a multi-feature multi-algorithm integration model, so that benign and malignant pulmonary nodules with high risk represented by the image are predicted, noninvasive accurate diagnosis of malignant pulmonary nodules is realized, and unnecessary benign pulmonary nodule resection operations are reduced.

Description

Application of gene marker in malignant pulmonary nodule screening, construction method of screening model and detection device
Technical Field
The invention relates to a lung nodule (radiographical high grade lung nodule) good and malignant early sieve which shows high risk to images, belonging to the technical field of molecular biomedicine.
Background
Lung cancer is one of the most well-established cancers in the world, and high-risk groups include those aged over 65 years and having one or more of the following risk factors. The risk factors include: severe smoking, once with a history of smoking, family history, receiving thoracic cavity radiation therapy, and carcinogens. Patients are generally diagnosed in the middle and late stages (stage III, IV) of lung cancer due to the lack of overt symptoms in the early stages of lung cancer. However, a number of studies have shown that lung cancer patients diagnosed at an early stage can have a higher survival rate. Patients diagnosed at stage one (stage I) of lung cancer have a 13-fold improvement in five-year survival over patients diagnosed at stage four (stage IV) of lung cancer. Therefore, early detection and diagnosis of lung tumors is crucial to improve survival of lung cancer patients.
Low-Dose chest Computed Tomography (LDCT) detection of lung nodules is today the most common diagnostic modality for lung tumor discovery. The lung nodules determined by imaging are subjected to surgical resection, so that the lung cancer death rate can be effectively reduced by 20% -39%. However, approximately 15% -35% of lung nodules, which are judged as high risk lung nodules in the initial LDCT image presentation, are ultimately identified as pathologically harmless after surgical resection. Therefore, the imaging test has certain limitations, and the diagnosis of malignant lung tumor is performed only according to the result of the imaging test, which increases unnecessary operations, causes unnecessary risks of operations and complications to patients, and increases the burden of medical expenses. Therefore, it is important to judge whether or not lung nodules are benign or malignant, which is judged as a high risk group of lung cancer only by imaging.
Disclosure of Invention
The technical problem to be solved by the invention is as follows: in the prior art, a noninvasive detection means is absent in the process of diagnosing benign and malignant lung cancer nodules, so unnecessary operations are caused, and the burden of patients is increased.
In the technical scheme of the patent, WGS sequencing is provided for plasma sample cfDNA, fragment information is obtained by performing high-throughput sequencing on a result, and a difference DNA Fragment length ratio (Fragment size ratio) of benign and malignant pulmonary nodules, a read ratio of 5-end breaking point sequences (break point motif), a 1MB window copy number variation (1 MB-bin copy number variance), a 16bp tumor new short sequence (16 bp neometers) and a Nucleosome coverage pattern (Nucleosome coverage patterns) are performed, and an elastic network logistic regression model (glm), extreme gradient integration (extreme gradient boosting, xgboost), a random forest (random forest) and a neural network (neural network) are utilized to construct a multi-feature multi-node precision model by utilizing an automatic machine learning model, so as to realize a noninvasive diagnosis of the malignant pulmonary nodules.
The specific technical scheme is as follows:
the application of the gene marker in preparing a malignant pulmonary nodule screening reagent;
the gene marker comprises:
a first marker: comparing the cfDNA fragments to the number of short reads and the number of long reads in different windows of the reference genome;
a second marker: the proportion of m base fragments aligned to the 5' end of the reference genome of different kinds of cfDNA fragments among all the base fragments;
a third marker: copy number in different windows on chromosomes in WGS data;
fourth marker: tumor new short sequence proportion;
a fifth marker: nucleosome coverage pattern.
The first marker is obtained by the following steps: and comparing the reading data results to a reference genome, dividing the reference genome into a plurality of windows, and respectively obtaining the ratio of the number of short reads to the number of ultra-long reads in each window range.
The second marker is obtained by the following steps: taking m base data of the 5' end in the reading data as a base fragment set, and obtaining the proportion of various base fragments in all the fragments.
The third marker is obtained by the following steps: the reference genome is divided into a plurality of windows, and copy number data of the WGS data in different windows on chromosomes 1 to 22 are obtained, respectively.
The fourth marker is obtained by the following steps:
generating a short sequence set A with the length of 16bp by an exhaustion method; exhaustively exhausting a short sequence set B with the length of 16bp in a human reference gene sequence, and defining the set B as an invalid seed after removing data in the set A;
obtaining WGS sequencing results of samples of different cancer species from a cancer database, and extracting base substitution mutation which appears for many times; according to the positions of the base substitutions, finding an invalid subsequence set C containing the base substitutions from the invalid seeds;
obtaining base substitution mutation with frequency more than 0.01 in east Asian population; according to the positions of the base substitutions, finding an invalid subsequence set D containing the base substitutions from the invalid seeds; eliminating invalid subsequences of the set D from the set C, and defining the invalid subsequences as new short sequences;
counting the number of samples in the samples which can read any new short sequence, searching the number of samples containing the new short sequences aiming at each new short sequence, and comparing the number of samples of each new short sequence with the total number of samples which can read any new short sequence.
The fifth marker is obtained by the following steps:
obtaining transcription factors from a GTRD database, and excluding the transcription factors which do not have known transcription sites in a CIS-BP database;
taking the range of-5 kb to +5kb near the transcription site of the obtained transcription factor as windows, obtaining fragments with the length of 100-220bp which can be compared to the windows, and sequentially carrying out GC correction and sequencing deep smoothing treatment on the read data in the windows to obtain a coverage mode curve of each transcription factor;
for each transcription factor, the following three features were obtained, collectively as the nucleosome coverage pattern:
1) For all transcription sites of the transcription factor, calculating the average depth from the upper end 1kb to the lower end 1kb of the transcription sites;
2) For the obtained coverage pattern curve, obtaining an amplitude value of a curve trough as the center depth of the transcription factor;
3) And performing fast Fourier transform on the obtained coverage mode curve to obtain the amplitude value of the highest point of the nucleosome amplitude signal.
The method for constructing the malignant lung nodule screening model comprises the following steps:
step 1, extracting cfDNA from samples of a positive group and a control group and sequencing to obtain reading data;
step 2, comparing the reading data results to a reference genome, dividing the reference genome into a plurality of windows, and respectively obtaining the short reading quantity ratio and the long reading quantity ratio in each window range as a first characteristic set;
step 3, taking m base data of the 5' end in the reading data as a base fragment set, and obtaining the proportion of various base fragments in all the fragments as a second characteristic set;
step 4, dividing the reference genome into a plurality of windows, and respectively obtaining copy number data in the range of each window as a third feature set;
step 5, taking the ratio of the number of samples reading the 16bp new short sequence to the total number of samples capable of reading any new short sequence as a fourth feature set;
step 6, using the nucleosome coverage pattern characteristics of the selected transcription factor as a fifth characteristic set;
and 7, taking the first, second, third, fourth and fifth feature sets as initial feature values together, taking the initial feature values as model feature vectors to be input into the classification model, taking the benign and malignant lung nodules as output values, training the model, and obtaining the early-screening model.
The step 3 comprises:
step 3-1, dividing the reference genome into a plurality of windows, and respectively obtaining the number of long reads and the number of short reads within the range of each window;
and 3-2, standardizing the short reading number and the long reading number of all windows in the step 3-1, and taking the ratio of the standardized short reading number and long reading number as a first characteristic value.
In the step 3-1, the window size is 5Mb, and 541 windows are divided.
The short read is 100-150bp in length, and the long read is 151-220bp.
In the step 3, m is 4.
In the step 4, the window size is 1Mb, and 2475 windows are divided.
In step 5, the step of obtaining the fourth feature set is as follows:
step 5-1, generating a short sequence set A with the length of 16bp by an exhaustion method; exhaustively exhausting a short sequence set B with the length of 16bp in a human reference gene sequence, and defining the set B as an invalid seed after removing data in the set A;
step 5-2, obtaining WGS sequencing results of samples of different cancer species from a cancer database, and extracting base substitution mutation appearing for many times; according to the positions of the base substitutions, finding an invalid subsequence set C containing the base substitutions from the invalid subsequence;
step 5-3, obtaining base substitution mutation with frequency more than 0.01 in the east Asia population; according to the positions of the base substitutions, finding an invalid subsequence set D containing the base substitutions from the invalid subsequence; eliminating invalid subsequences of the set D from the set C, and defining the invalid subsequences as new short sequences;
and 5-4, counting the number of samples which can read any new short sequence in the samples, searching the number of samples containing the new short sequences for each new short sequence, and taking the ratio of the number of samples of each new short sequence to the total number of samples which can read any new short sequence as a fourth feature set of the model.
The cancer database is a PCAWG database.
The different cancer species include intestinal cancer, lung cancer, breast cancer, gastric cancer, prostate cancer and liver cancer.
Base substitution mutations in the east asian population were obtained from the gnomAD database.
The step 6 comprises the following steps:
step 6-1, obtaining transcription factors from a GTRD database, and excluding the transcription factors which do not have known transcription sites in a CIS-BP database;
step 6-2, taking the range of-5 kb to +5kb near the transcription site of the transcription factor obtained in the step 6-1 as a window, obtaining fragments with the length of 100-220bp which can be compared to the windows, and sequentially carrying out GC correction and sequencing deep smoothing treatment on the read data in the window to obtain a coverage mode curve of each transcription factor;
step 6-3, for each transcription factor, the following three features were obtained, collectively as nucleosome coverage pattern features:
1) For all transcription sites of the transcription factor, calculating the average depth from the upper end 1kb to the lower end 1kb of the transcription sites;
2) Obtaining the amplitude value of the trough of the curve as the central depth of the transcription factor for the obtained coverage mode curve;
3) And performing fast Fourier transform on the obtained coverage mode curve to obtain the amplitude value of the highest point of the nucleosome amplitude signal.
In the step 7, the step of classifying the model includes:
step 7-1, inputting the first, second, third, fourth and fifth feature sets into different classifier models respectively, training the models, and obtaining one or more optimal classifier models respectively aiming at the first, second, third, fourth and fifth feature sets;
and 7-2, performing secondary set training on the optimal classifier models of the first, second, third, fourth and fifth feature sets obtained in the step 7-1 to construct an integrated classifier model.
The different classifier models are selected from elastic net regression (glm), extreme gradient boosting (xgboost), random forest (random forest), deep learning neural network (deep learning, NN).
In the quadratic ensemble training, a Generalized Linear Model (GLM), a polar gradient boost Xgboost Model or a deep learning regression Model are used.
A malignant lung nodule detection apparatus comprising:
the sequencing module is used for extracting and sequencing cfDNA of the samples of the positive group and the control group to obtain reading data and WGS sequencing data;
the first characteristic acquisition module is used for comparing the reading data results to a reference genome, dividing the reference genome into a plurality of windows, and respectively obtaining the ratio of the number of short reads to the number of ultra-long reads in each window range as a first characteristic set;
the second characteristic acquisition module is used for taking m base data of the 5' end in the reading data as a base fragment set and obtaining the proportion of various base fragments in all the fragments as a second characteristic set;
the third characteristic acquisition module is used for dividing the reference genome into a plurality of windows and respectively acquiring copy number data of the WGS data in different windows on the chromosome as a third characteristic set;
the fourth characteristic acquisition module is used for taking the proportion of the number of samples which read the 16bp new short sequence to the total number of all samples which can read any new short sequence as a fourth characteristic set;
the fifth characteristic acquisition module analyzes the nucleosome coverage pattern characteristic of the selected transcription factor to serve as a fifth characteristic set;
and the prediction module is used for taking the first, second, third, fourth and fifth feature sets as initial feature values together, taking the initial feature values as model feature vectors to be input into the classification model, taking the benign and malignant lung nodules as output values, training the model and obtaining the early-screening model.
Drawings
FIG. 1 is a schematic diagram of a model building process;
FIG. 2 is a difference distribution plot of the largest contributing value feature among the various features;
FIG. 3 is a graph of AUC curves for models using individual features and models using all of the features in the training set alone;
FIG. 4 is an AUC curve for a model using all features on the validation set;
FIG. 5 is a graph of the predicted score of a classifier after the set of all models on the training set;
FIG. 6 is a graph of the predicted score of the classifier after validating all of the sets of models on the set.
Detailed Description
The calculation method of the invention is detailed as follows:
the invention firstly needs to carry out the steps of extraction, library construction, sequencing and the like of cfDNA from blood samples. The extraction and library construction method is not particularly limited, and can be adjusted from the extraction methods in the prior art. The base information of cfDNA can be obtained using a sequencing technique in the related art in the sequencing process here.
The purpose of the model in this patent is to distinguish between benign lung nodules (benign lung nodules) and malignant lung nodules (malignant lung nodules). Classifying the samples, and regarding the lung nodule patients judged to be high-risk in LDCT image detection, taking the patients judged to be benign lung nodules according to subsequent postoperative pathology as a control group, and taking the patients judged to be malignant lung nodules as a positive group.
The data set conditions adopted in the model construction process of the invention are as follows:
Figure 423357DEST_PATH_IMAGE001
extraction and sequencing method of plasma cfDNA sample:
before LDCT image diagnosis, a liquid biopsy is performed on a patient. A purple blood collection tube (EDTA anticoagulation tube) is adopted to collect 10ml of whole blood sample of a patient, plasma is timely centrifugally separated (within 2 hours), and the sample is transferred to a laboratory for analysis under the condition of refrigeration and preservation at the temperature of 80 ℃ below zero. After transfer to the laboratory, plasma samples were subjected to ctDNA extraction using QIAGEN plasma DNA extraction kit according to the instructions. And establishing a library for the collected cfDNA sample, and performing WGS-5-times sequencing. After the off-line data is obtained, the data is compared to the human reference genome to obtain the base data information of the corresponding reading.
The model establishing process of the patent mainly comprises the following steps:
step 2, extracting and sequencing cfDNA of the samples of the positive group and the control group to obtain a read data;
step 3, comparing the reading data results to a reference genome, obtaining the number of the reads in different length intervals in different window ranges on the reference genome, and taking the ratio of the number of the reads with different lengths as a first characteristic value;
step 4, comparing the reading data result to a reference genome to obtain the position of the 5' end of the reading on the reference genome; obtaining sequence data of m bp bases at the upstream and downstream of the position as a base fragment set; taking the proportion of each obtained base fragment in all the fragments as a second characteristic value;
step 5, dividing the reference genome into a plurality of windows, and respectively obtaining copy number data in the range of each window as a third characteristic value;
step 6, taking the ratio of the read reading of the 16bp new short sequence to the total reading as a fourth characteristic;
step 7, analyzing the nucleosome coverage mode of the selected transcription factor as a fifth characteristic;
step 8, inputting the model feature vectors of the samples of the positive group and the control group into a first layer model, selecting 5 models with the best AUC corresponding to each feature, and selecting 25 feature models in the first layer model;
step 9, inputting the 25 models selected in the step 9 into a second-layer integrated model, outputting three integrated models with the top AUC sequence, and taking the average value of the predicted possibility results output by the three integrated models as a final judgment result;
the total five characteristic values in this patent are detailed as follows:
DNA Fragment Size Ratio (FSR)
For the size ratio of DNA fragments, it reflects the distribution characteristics of the length size of cfDNA reads in benign and malignant tumors. Performing machine learning to establish a prediction model by using the ratio of the short DNA fragment to the long DNA fragment, and later (benign lung nodes) and malignant lung nodules (malignant lung nodes);
the cfDNA read length data is obtained by the following method: the quality, length and alignment position information of each read was recorded in aligned bams, and the human reference genome selected for the hg19 sequence provided by University of California, cruz (University of California, santa Cruz, UCSC). Cutting a human reference genome into 541 windows according to the length of 5Mb, respectively counting the number of short reads (100 bp-151 bp) and the number of long reads (151 bp-220 bp) in each window, and respectively carrying out standardization conversion on the number of each read according to the counting results of the number of the reads in all the windows, namely, a standardization value = (original value-average value)/standard deviation. This results in 541 sets of numbers of reads of different lengths.
2. Number of reads at Breakpoint of 5' end of reads in terms of percentage of sequence reads (Breakpoint Motifs, BPM)
The human reference genome is a DNA double-helix structure and is linked by hydrogen bond through base complementary pairing; in the normal aging and cancer progression processes, the pH value of the environment around the five cells changes, so that the complementary hydrogen bonds of the basic groups are destroyed and the breakage occurs; the base sequences at the break are different, and the ratio of sequences containing information on the sequences at different breaks is also different. The collection method comprises the following steps: in the aligned bam, the basic information and the aligned position of each read are recorded, the 4bp sequences around the breakpoint of the human reference genome sequence coordinate where the 5' end of each read is located are confirmed, the read number of 8bp length sequences (4 × 8=65536 in total) at each breakpoint is counted, and 65536 breakpoint site sequence read ratios are calculated, for example, AAAAAAAA read ratio = aaaaaaaaaa read number/total breakpoint site sequence read numbers.
3.1 Mb Window Copy Number Variation (1 Mb-Bin Copy Number Variation, CNV)
Copy number changes are highly correlated with individual cancers, and although it has been possible to distinguish them by detecting copy number changes in a portion of the cancer-associated genes or in a particular genomic interval, there are other rare or unknown genes or intervals that can provide information about potential copy number changes. The collection method comprises the following steps: firstly, collecting WGS data of 30 healthy people, dividing a reference gene chromosome 1-22 into windows in a length of 1Mb in a non-overlapping mode, calculating the reading depth in each window for each sample by using a bdtools coverage, correcting according to the GC content and average comparison capability record (UCSC BigWig file) of each window, and taking the median depth of 30 healthy people in each window as a representative to obtain a population comparison base line of 2475 window reading depths; for each sample to be detected, 2475 pieces of window individual reading depth information are obtained in the same way, and the copy number change logarithm of each window, namely log2 (the depth of the sample to be detected after being corrected and homogenized/the depth of the group baseline after being corrected and homogenized) is constructed by using a Hidden Markov Model (HMM) and the group baseline contrast depth of each window, so that the copy number change information of each sample to be detected is obtained.
4.16bp tumor short sequence (16 bp Neomers, NEO)
Nullomer (Nullomer) refers to a short sequence of DNA that is not present in the human genome, and a 16bp tumor novel short sequence is a subset of nulls, specifically nulls of 16bp in length that are not present in the human genome but are repeatedly found in the genome of tumor tissue.
The characteristic value is obtained in the following way in the patent:
first, an exhaustive method was used to generate all possible short sequence sets A of 16bp in length. And in a human reference gene sequence (hg 19 version), 1bp is used as a sliding window, an exhaustive algorithm is used for searching all short sequence sets B with the length of 16bp and the occurrence times thereof, and the 16bp short sequences appearing in the set A are defined as the nulling seeds.
The present patent focuses on obtaining the WGS mutation results of 2577 patients with 6 different types of cancer (intestinal cancer, lung cancer, breast cancer, gastric cancer, prostate cancer, liver cancer) by analyzing the PCAWG database (https:// dcc. Icgc.org/PCAWG), and extracting 977 multiple base substitutions (occurring at least twice) from them. And extracting a set C of all possible invalid sub-short sequences comprising the base substitution by using an exhaustive method according to the position of the base substitution.
Through a gnomAD (https:// gnomAD. Broadproperty. Org /) database, mutation sites of base substitution with the frequency of east Asian population being more than 0.01 are collected, and according to the positions of the mutation sites, a short sequence set D containing the mutation sites is found from an invalid seed and is used as a common invalid seed set for collecting the east Asian population. And eliminating the invalid subsequence of the set D from the set C to obtain 4616 new 16bp tumor-related short sequences. For the 4616 new short sequences, firstly, the number of samples capable of reading any one 4616 new short sequences in the samples is counted, then, for each new short sequence, the number of samples containing the new short sequences is searched, and the ratio of the number of samples of each new short sequence to the total number of samples capable of reading any new short sequence (4616 ratio values in total) is used as the fourth feature of the model.
5. Nucleosome coverage Pattern (Nucleosome coverage patterns, NCP)
The transcription factors are selected from GTRD database (https:// rd. Bieuml. Org/# |) (v 21.12), the transcription factors which do not have known transcription sites in CIS-BP database (http:// cisbp. Ccbr. Utono. ca /) (v 2.00) are excluded, and 334 transcription factors with more than 10000 high-matching sites are selected.
For the transcription factors obtained above, using the range of-5 kb to +5kb near the transcription site in these target transcription factors as windows, fragments of 100-220bp in length that can be aligned into these windows were obtained. For these fragments, GC correction was performed and the final coverage pattern curve for each transcription factor was obtained using a Savitzky-Golay filter flattening curve with a polynomial power of 3 for the sequencing depth.
After obtaining the coverage pattern curve described above, three features were extracted for each transcription factor:
1) For all transcription sites of the transcription factor, the average depth from the upper 1kb to the lower 1kb of these transcription sites is determined;
2) The center depth of the transcription factor;
3) And performing fast Fourier transform on the obtained coverage mode curve to obtain the amplitude value of the highest point of the nucleosome amplitude signal.
These three features are collectively taken as the eigenvalues of the nucleosome coverage pattern.
Through the above data acquisition, initial data vectors of the five types of data can be obtained, respectively. Then, a corresponding calculation method is designed, in the patent, a conventional classifier algorithm can be adopted to classify through the characteristic values, the characteristic values are input into a classifier, and the probability value of malignant lung nodules is used as output. The classifier models adopted by the optimization in the patent comprise the following four types, and when the classifier is optimized, sub models with different model parameters under the same model are simultaneously generated for screening the sub models. The four main models include:
1. elastic network regression model (glm)
The elastic network regression model is a common algorithm in machine learning, is a model for fitting generalized linear simulation by punishing maximum likelihood, and combines L2 regularization of ridge regression and L1 regularization algorithm of LASSO regression. The regularization path is calculated against the lasso or elastic network penalty at the value grid of the regularization parameter λ, solving the over-fitting problem in the regression. The hyper-parameter alpha controls the distribution of regularization L1, L2.
2. Extreme gradient boost (xgboost)
The optimization algorithm is an optimization algorithm of an integrated thought addition model based on a Gradient Boosting Decision Tree (GBDT). The method is developed by using a second-order Taylor formula, a loss function is optimized, the calculation accuracy is improved, a model is simplified by using a regular term, overfitting is avoided, and parallel calculation can be performed by using a Blocks storage structure.
The XGboost and learner used in this patent is a tree model. When the depth of the tree increases, the complexity of the tree increases, the model can be better trained, and the overfitting problem can also be caused, the hyperparameter max _ depth is used for controlling the maximum depth of the tree, and the hyperparameter min _ rows is used for controlling the minimum number of samples of each leaf node.
3. Random forest (random forest)
Random forests are a powerful classification and regression tool for high-dimensional and multicollinearity situations. When a group of data sets are provided, the random forest can randomly extract partial information to generate a group of decision-making forests for assisting classification or regression, node splitting attributes are made, and random extraction is continuously repeated until splitting can not be performed; and finally, combining all the split attribute results to obtain a final prediction result. The random forest also controls the complexity of the tree through super parameters such as max _ depth, min _ rows and the like.
4. Deep learning neural network model (deep learning, NN)
The neural network consists of inputs, weights, biases or thresholds and outputs, and the output of any single node is above a specified threshold, then that node is activated and the data is sent to the next layer of the network. Each node of the input layer and each node of the hidden layer are subjected to point-to-point calculation by using a weighted summation and activation method. Each value calculated using the hidden layer is calculated using the same method and output layer. The method has the advantages of high classification accuracy, strong parallel distribution processing capability and strong distribution storage and learning capabilities.
The deep learning neural network model used in the patent is a multi-layer feedforward neural network (multi-layer fed forward neural network), the neural network structure of the feedforward neural network is a foremost input layer, a middle hidden layer and a last output layer, the middle can contain a plurality of hidden layers (hidden layers) with multiple layers and complexity, network signals are transmitted from the input layer to the output layer in a one-way mode, and neurons of each layer are only connected with neurons of the previous layer to receive information of the previous layer. The deep learning neural network is optimized and trained by adopting a stochastic gradient descent algorithm, estimates the error gradient of the current state of the model by using data in a training data set, and then updates the weight of the model by using the error. The amount of updating weights during training is the learning rate (epsilon) used to control the speed of model adaptation to the problem, the learning rate decay index (rho), and also the rate of model learning, are all important parameters that can be configured in neural network training.
In addition, the patent also adopts a Random Grid Search Parameters algorithm for optimizing the model.
Random search is a common method of machine learning hyperparametric optimization. The random search is to randomly extract parameter values in a specific model parameter range and select an optimal parameter combination from a plurality of sampled parameter values. Rather than trying all possible combinations, the method selects a certain number of random combinations of a random value for each hyper-parameter. Compared with manual tuning and network search for model tuning, random search can achieve a good effect with fewer search times and provides a more efficient solution (especially under the condition of a large number of parameters).
In the implementation process of the four algorithms in the patent, the following \8230; (algorithm type) (or which algorithm toolkit is called at once) is adopted specifically
The hyper-parameters of the four algorithms used in this patent are shown in the following table:
Figure 971013DEST_PATH_IMAGE002
after acquiring the five types of initial data information of 247 cases of malignant lung nodules and 60 cases of benign lung nodules of patients, using a Fragment Size Ratio (FSR) statistical result as an input value (an input vector of each sample comprises a characteristic value formed by 541 read fragment size ratios), classifying the malignant lung nodule samples and the benign lung nodule samples through four classification models respectively, in the screening process, performing parameter and structure change on the four types of models through random search of hyper-parameters respectively, using the parameters as a sub-model to perform data training and model suggestion, and selecting five optimal sub-models of the characteristics, wherein during screening, an AUC curve of a training set of the models is used as an index of a classification effect; similarly, after collecting the data proportion information of the breakpoint sequence reading at the 5 'end of the DNA fragment of the benign pulmonary nodule patient and the malignant pulmonary nodule patient, classifying the malignant pulmonary nodule sample and the benign pulmonary nodule sample through five characteristics by using the breakpoint sequence proportion at the 5' end of the DNA fragment (65536 kinds) as an input value through four classification models, and selecting five optimal submodels of the characteristics (the specific model optimization and the hyper-parameter adjustment process are the same as above). Similarly, copy number variation (2475), new short sequences (4616) and nucleosome coverage patterns (1002) are also used as input values, classified by four types of models respectively, and the optimal five submodels are selected for each feature (the specific model optimization process is the same as above). Through the above calculation process, a total of 5 × 5=25 model calculation results are obtained. In each calculation, the contribution value of each feature vector to the classification result may be obtained.
The 5 optimal models (25 models in total) selected by each feature are respectively as shown in the following table:
Figure 203412DEST_PATH_IMAGE003
Figure 709479DEST_PATH_IMAGE004
the contribution values and the feature variables of the optimal model selected by each feature are ranked as follows:
DNA Fragment Size Ratio (FSR) Deep Learning neural network model (Deep Learning, NN):
Figure 378227DEST_PATH_IMAGE005
Figure 995153DEST_PATH_IMAGE006
2.5' end break point sequence (BPM) read elastic network regression model (GLM):
Figure 82058DEST_PATH_IMAGE007
Figure 759027DEST_PATH_IMAGE008
3. copy Number Variation (CNV) Deep Learning neural network model (Deep Learning, NN)
Figure 931382DEST_PATH_IMAGE009
Figure 86420DEST_PATH_IMAGE010
4.16bp tumor New short sequence (NEO) XgBoost model:
Figure 293410DEST_PATH_IMAGE011
Figure 390548DEST_PATH_IMAGE012
5. nucleosome Coverage Pattern (NCP) XgBoost model:
Figure 50200DEST_PATH_IMAGE013
Figure 743349DEST_PATH_IMAGE014
in order to further improve the prediction performance of the classifier, secondary set training (stacking) is carried out on the 25 training model results. Stacking is an ensemble learning technique by applying 25 low-level classifiers (1) st -level base model) to do meta learning again (2) nd Level meta-learning), collecting the characteristics of each bottom-layer classifier, and finding out an optimal integration mode, thereby improving the model prediction performance. Finally, the training algorithm used by the Stacking system is a Generalized Linear Model (GLM), a polar gradient boost Xgboost Model and a deep learning regression Model.
The 3 optimal integration models and feature model importance (variables import) are shown in the following table:
Figure 539267DEST_PATH_IMAGE015
Figure 558038DEST_PATH_IMAGE016
the optimal stacking model with the highest AUC is a Generalized Linear Model (GLM), a relation between a mathematical expectation value of a response variable and a prediction variable of a linear combination is established through a coupling function, and the 25 training models are converted into a final linear equation: ALL Stacked = Intercept + a × CNV model 1 + B × CNV model 2 + C × CNV model 3 + D × CNV model 4 + E × CNV model 5+ F × BPM model 1 + G × BPM model 2 + H × BPM model 3 + I × M model 4 + J × BPM model 5+ K × FSR model 1 + L × FSR model 2 + M × FSR model 3 + N × FSR model 4 + O × FSR model 5+ P × O model 1 + Q × NEO model 2 + R × O3 + S × NEO model 4 + T × N U5 + NCP model 1 + P × ncv model.
The specific coefficients are as follows:
Figure 704986DEST_PATH_IMAGE017
Figure 467406DEST_PATH_IMAGE018
each feature has a certain prediction effect under different training algorithms, and the prediction effect of the feature is improved by training a single feature in a secondary set. And finally, the AUC of the prediction result for the training set is up to 0.9474, the AUC of the prediction result for the verification set is up to 0.931, the sensitivity is 85%, and the specificity is 98.7%.

Claims (8)

1. The application of the gene marker in preparing a malignant pulmonary nodule screening reagent is characterized in that the gene marker comprises:
a first marker: comparing the cfDNA fragments to the number of short reads and the number of long reads in different windows of the reference genome;
a second marker: the proportion of m base fragments aligned to the 5' end of the reference genome of different kinds of cfDNA fragments among all the base fragments;
a third marker: copy number in different windows on chromosomes in WGS data;
a fourth marker: the new short sequence proportion of the tumor;
fifth marker: nucleosome coverage pattern;
the fourth marker is obtained by the following steps:
generating a short sequence set A with the length of 16bp by an exhaustion method; exhausting all short sequence sets B with the length of 16bp in the human reference gene sequence, and defining the sets B as invalid seeds after removing the data of the sets B from the sets A;
obtaining WGS sequencing results of samples of different cancer species from a cancer database, and extracting base substitution mutation which appears for many times; according to the positions of the base substitutions, finding an invalid subsequence set C containing the base substitutions from the invalid subsequence;
obtaining base substitution mutation with frequency more than 0.01 in east Asia population; according to the positions of the base substitutions, finding an invalid subsequence set D containing the base substitutions from the invalid subsequence; eliminating invalid subsequences of the set D from the set C, and defining the invalid subsequences as new short sequences;
counting the number of samples in the samples which can read any new short sequence, searching the number of samples containing the new short sequences aiming at each new short sequence, and comparing the number of samples of each new short sequence with the total number of samples which can read any new short sequence;
the fifth marker is obtained by the following steps:
obtaining transcription factors from a GTRD database, and excluding the transcription factors with known transcription sites which are not in a CIS-BP database;
taking the range of-5 kb to +5kb near the transcription site of the obtained transcription factor as windows, obtaining fragments with the length of 100-220bp which can be compared to the windows, and sequentially carrying out GC correction and sequencing deep smoothing treatment on the read data in the windows to obtain a coverage mode curve of each transcription factor;
for each transcription factor, the following three features were obtained, collectively as the nucleosome coverage pattern:
1) For all transcription sites of the transcription factor, calculating the average depth from the upper end 1kb to the lower end 1kb of the transcription sites;
2) For the obtained coverage pattern curve, obtaining an amplitude value of a curve trough as the center depth of the transcription factor;
3) And performing fast Fourier transform on the obtained coverage mode curve to obtain the amplitude value of the highest point of the nucleosome amplitude signal.
2. The use of claim 1, wherein said first marker is obtained by: comparing the reading data results to a reference genome, dividing the reference genome into a plurality of windows, and respectively obtaining the ratio of the number of short reads to the number of ultra-long reads within the range of each window;
the second marker is obtained by the following steps: taking m base data of the 5' end in the reading data as a base fragment set, and obtaining the proportion of various base fragments in all fragments;
the third marker is obtained by the following steps: the reference genome is divided into a plurality of windows, and copy number data in different windows on chromosomes 1 to 22 in the WGS data are obtained separately.
3. The method for constructing the malignant lung nodule screening model is characterized by comprising the following steps of:
step 1, extracting and sequencing cfDNA of samples of a positive group and a control group to obtain reading data;
step 2, comparing the reading data results to a reference genome, dividing the reference genome into a plurality of windows, and respectively obtaining the ratio of the number of short reads to the number of ultra-long reads in each window range as a first feature set;
step 3, taking m base data of the 5' end in the reading data as a base fragment set, and obtaining the proportion of various base fragments in all the fragments as a second characteristic set;
step 4, dividing the reference genome into a plurality of windows, and respectively obtaining copy number data of WGS data in different windows on a chromosome as a third feature set;
step 5, taking the ratio of the number of samples reading the 16bp new short sequence to the total number of samples capable of reading any new short sequence as a fourth feature set;
step 6, analyzing nucleosome coverage pattern characteristics of the selected transcription factor to serve as a fifth characteristic set;
step 7, taking the first, second, third, fourth and fifth feature sets as initial feature values together, taking the initial feature values as model feature vectors to be input into a classification model, and taking benign and malignant lung nodules as output values to train the model to obtain an early screening model;
the fourth feature set is obtained as follows:
step 5-1, generating a short sequence set A with the length of 16bp by an exhaustion method; exhausting all short sequence sets B with the length of 16bp in the human reference gene sequence, and defining the sets B as invalid seeds after removing the data of the sets B from the sets A;
step 5-2, obtaining sample WGS sequencing results of different cancer species from a cancer database, and extracting base substitution mutation which appears for many times; according to the positions of the base substitutions, finding an invalid subsequence set C containing the base substitutions from the invalid subsequence;
step 5-3, obtaining base substitution mutation with frequency more than 0.01 in the east Asian population; according to the positions of the base substitutions, finding an invalid subsequence set D containing the base substitutions from the invalid subsequence; eliminating invalid subsequences of the set D from the set C, and defining the invalid subsequences as new short sequences;
step 5-4, counting the number of samples which can read any new short sequence in the samples, searching the number of samples containing the new short sequences aiming at each new short sequence, and taking the ratio of the number of samples of each new short sequence to the total number of samples which can read any new short sequence as a fourth feature set of the model;
the cancer database is a PCAWG database;
different cancer species refer to intestinal cancer, lung cancer, breast cancer, gastric cancer, prostate cancer and liver cancer;
base substitution mutations in the east asian population were obtained from the gnomAD database;
the step 6 comprises the following steps:
step 6-1, obtaining transcription factors from a GTRD database, and excluding the transcription factors which do not have known transcription sites in a CIS-BP database;
step 6-2, taking the range of-5 kb to +5kb near the transcription site of the transcription factor obtained in the step 6-1 as a window, obtaining fragments with the length of 100-220bp which can be compared to the windows, and sequentially carrying out GC correction and sequencing deep smoothing treatment on the read data in the window to obtain a coverage mode curve of each transcription factor;
step 6-3, for each transcription factor, the following three features were obtained, collectively as nucleosome coverage pattern features:
1) For all transcription sites of the transcription factor, calculating the average depth from the upper end 1kb to the lower end 1kb of the transcription sites;
2) For the obtained coverage pattern curve, obtaining an amplitude value of a curve trough as the center depth of the transcription factor;
3) And performing fast Fourier transform on the obtained coverage mode curve to obtain the amplitude value of the highest point of the nucleosome amplitude signal.
4. The method for constructing a malignant lung nodule screening model according to claim 3, wherein the step 3 comprises: step 3-1, dividing the reference genome into a plurality of windows, and respectively obtaining the number of long reads and the number of short reads within the range of each window; and 3-2, standardizing the short reading number and the long reading number of all windows in the step 3-1, and taking the ratio of the standardized short reading number and long reading number as a first characteristic value.
5. The method as claimed in claim 4, wherein the window size in step 3-1 is 5Mb, and 541 windows are defined.
6. The method of claim 3, wherein the short reads are from 100 to 150bp long reads and the long reads are from 151 to 220bp long reads;
in the step 3, m is 4;
in the step 4, the window size is 1Mb, and 2475 windows are divided.
7. The method of claim 3, wherein in step 7, the step of classifying the model comprises:
step 7-1, inputting the first, second, third, fourth and fifth feature sets into different classifier models respectively, training the models, and obtaining one or more optimal classifier models respectively aiming at the first, second, third, fourth and fifth feature sets;
step 7-2, performing secondary set training on the optimal classifier models of the first, second, third, fourth and fifth feature sets obtained in the step 7-1 to construct an integrated classifier model;
the different classifier models are selected from an elastic network regression model, a polar gradient lifting model, a random forest model and a deep learning neural network model; a generalized linear model, a polar gradient boosting Xgboost model or a deep learning regression model is adopted in the secondary set training.
8. A malignant lung nodule detecting apparatus, comprising:
the sequencing module is used for extracting and sequencing cfDNA of the samples of the positive group and the control group to obtain reading data and WGS sequencing data;
the first characteristic acquisition module is used for comparing the reading data results to a reference genome, dividing the reference genome into a plurality of windows, and respectively obtaining the ratio of the number of short reads to the number of ultra-long reads in each window range as a first characteristic set;
the second characteristic acquisition module is used for taking m base data of the 5' end in the reading data as a base fragment set and obtaining the proportion of various base fragments in all the fragments as a second characteristic set;
the third characteristic acquisition module is used for dividing the reference genome into a plurality of windows and respectively acquiring copy number data of the WGS data in different windows on the chromosome as a third characteristic set;
the fourth characteristic acquisition module is used for taking the proportion of the number of samples which read the 16bp new short sequence to the total number of all samples which can read any new short sequence as a fourth characteristic set;
the fifth characteristic acquisition module analyzes the nucleosome coverage pattern characteristic of the selected transcription factor to serve as a fifth characteristic set;
the prediction module is used for taking the first, second, third, fourth and fifth feature sets as initial feature values together, taking the initial feature values as model feature vectors to be input into the classification model, taking the benign and malignant lung nodules as output values, and training the model to obtain an early-screening model;
the fourth feature set is obtained as follows:
step 5-1, generating a short sequence set A with the length of 16bp by an exhaustion method; exhausting all short sequence sets B with the length of 16bp in the human reference gene sequence, and defining the sets B as invalid seeds after removing the data of the sets B from the sets A;
step 5-2, obtaining WGS sequencing results of samples of different cancer species from a cancer database, and extracting base substitution mutation appearing for many times; according to the positions of the base substitutions, finding an invalid subsequence set C containing the base substitutions from the invalid subsequence;
step 5-3, obtaining base substitution mutation with frequency more than 0.01 in the east Asian population; according to the positions of the base substitutions, finding an invalid subsequence set D containing the base substitutions from the invalid subsequence; eliminating invalid subsequences of the set D from the set C, and defining the invalid subsequences as new short sequences;
step 5-4, counting the number of samples which can read any new short sequence in the samples, searching the number of samples containing the new short sequences for each new short sequence, and taking the ratio of the number of samples of each new short sequence to the total number of samples which can read any new short sequence as a fourth feature set of the model;
the cancer database is a PCAWG database;
the different cancer species include intestinal cancer, lung cancer, breast cancer, gastric cancer, prostate cancer and liver cancer;
base substitution mutations in the east asian population were obtained from the gnomAD database;
the step 6 comprises the following steps:
step 6-1, obtaining transcription factors from a GTRD database, and excluding the transcription factors which do not have known transcription sites in a CIS-BP database;
step 6-2, taking the range of-5 kb to +5kb near the transcription site of the transcription factor obtained in the step 6-1 as a window, obtaining fragments which can be compared to the window and have the length of 100-220bp, and sequentially carrying out GC correction and sequencing deep smoothing treatment on read data in the window to obtain a coverage mode curve of each transcription factor;
step 6-3, for each transcription factor, the following three features were obtained, collectively as nucleosome coverage pattern features:
1) For all transcription sites of the transcription factor, calculating the average depth from the upper end 1kb to the lower end 1kb of the transcription sites;
2) For the obtained coverage pattern curve, obtaining an amplitude value of a curve trough as the center depth of the transcription factor;
3) And performing fast Fourier transform on the obtained coverage mode curve to obtain the amplitude value of the highest point of the nucleosome amplitude signal.
CN202211220583.XA 2022-10-08 2022-10-08 Application of gene marker in malignant pulmonary nodule screening, construction method of screening model and detection device Active CN115295074B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202211453553.3A CN116052768A (en) 2022-10-08 2022-10-08 Malignant lung nodule screening gene marker, construction method of screening model and detection device
CN202211220583.XA CN115295074B (en) 2022-10-08 2022-10-08 Application of gene marker in malignant pulmonary nodule screening, construction method of screening model and detection device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211220583.XA CN115295074B (en) 2022-10-08 2022-10-08 Application of gene marker in malignant pulmonary nodule screening, construction method of screening model and detection device

Related Child Applications (1)

Application Number Title Priority Date Filing Date
CN202211453553.3A Division CN116052768A (en) 2022-10-08 2022-10-08 Malignant lung nodule screening gene marker, construction method of screening model and detection device

Publications (2)

Publication Number Publication Date
CN115295074A CN115295074A (en) 2022-11-04
CN115295074B true CN115295074B (en) 2022-12-16

Family

ID=83834944

Family Applications (2)

Application Number Title Priority Date Filing Date
CN202211220583.XA Active CN115295074B (en) 2022-10-08 2022-10-08 Application of gene marker in malignant pulmonary nodule screening, construction method of screening model and detection device
CN202211453553.3A Pending CN116052768A (en) 2022-10-08 2022-10-08 Malignant lung nodule screening gene marker, construction method of screening model and detection device

Family Applications After (1)

Application Number Title Priority Date Filing Date
CN202211453553.3A Pending CN116052768A (en) 2022-10-08 2022-10-08 Malignant lung nodule screening gene marker, construction method of screening model and detection device

Country Status (1)

Country Link
CN (2) CN115295074B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115678999B (en) * 2022-12-30 2023-05-26 南京世和基因生物技术股份有限公司 Application of marker in lung cancer recurrence prediction and prediction model construction method
CN115984629B (en) * 2023-02-14 2024-02-02 成都泰莱生物科技有限公司 Lung nodule classification method and product based on fusion of lung CT and 5mC marker
CN117334249A (en) * 2023-05-30 2024-01-02 上海品峰医疗科技有限公司 Method, apparatus and medium for detecting copy number variation based on amplicon sequencing data
CN117352064B (en) * 2023-12-05 2024-02-09 成都泰莱生物科技有限公司 Lung cancer metabolic marker combination and screening method and application thereof

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113362897A (en) * 2020-03-06 2021-09-07 福建和瑞基因科技有限公司 Tumor marker screening method based on nucleosome distribution characteristics and application

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2971152B1 (en) * 2013-03-15 2018-08-01 The Board Of Trustees Of The Leland Stanford Junior University Identification and use of circulating nucleic acid tumor markers
BR112019000296A2 (en) * 2016-07-06 2019-04-16 Guardant Health, Inc. methods for cell free nucleic acid fragmentome profiling
CN108753790B (en) * 2018-06-12 2020-12-11 北京市神经外科研究所 BAVM-associated gene markers and mutations thereof
CN110106244A (en) * 2019-06-06 2019-08-09 广州市雄基生物信息技术有限公司 A kind of noninvasive molecule parting kit of breast cancer and method
CN116157868A (en) * 2020-08-18 2023-05-23 德尔菲诊断公司 Methods and systems for free DNA fragment size density to assess cancer
CN113903398A (en) * 2021-09-08 2022-01-07 南京世和基因生物技术股份有限公司 Intestinal cancer early-screening marker, detection method, detection device, and computer-readable medium

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113362897A (en) * 2020-03-06 2021-09-07 福建和瑞基因科技有限公司 Tumor marker screening method based on nucleosome distribution characteristics and application

Also Published As

Publication number Publication date
CN115295074A (en) 2022-11-04
CN116052768A (en) 2023-05-02

Similar Documents

Publication Publication Date Title
CN115295074B (en) Application of gene marker in malignant pulmonary nodule screening, construction method of screening model and detection device
CN112086129B (en) Method and system for predicting cfDNA of tumor tissue
US10489550B2 (en) Predictive test for aggressiveness or indolence of prostate cancer from mass spectrometry of blood-based sample
CN109801680B (en) Tumor metastasis and recurrence prediction method and system based on TCGA database
CN111243673B (en) Tumor screening model, and construction method and device thereof
CN111276252B (en) Construction method and device of tumor benign and malignant identification model
WO2023197825A1 (en) Multi-cancer early screening model construction method and detection device
CN114783524B (en) Path abnormity detection system based on self-adaptive resampling depth encoder network
CN112289376B (en) Method and device for detecting somatic cell mutation
CN108537005A (en) A kind of crucial lncRNA prediction techniques based on BPSO-KNN models
CN113851185A (en) Prognosis evaluation method for immunotherapy of non-small cell lung cancer patient
CN115896242A (en) Intelligent cancer screening model and method based on peripheral blood immune characteristics
CN114613430A (en) Filtering method and computing equipment for false positive nucleotide variation sites
CN116153420B (en) Application of gene marker in early screening of malignant breast cancer and benign breast nodule and construction method of screening model
CN111944902A (en) Early prediction method of renal papillary cell carcinoma based on lincRNA expression profile combination characteristics
CN111944900A (en) Characteristic lincRNA expression profile combination and early endometrial cancer prediction method
CN111748634A (en) Characteristic lincRNA expression profile combination and early prediction method of colon cancer
CN116312800A (en) Lung cancer characteristic identification method, device and storage medium based on circulating RNA whole transcriptome sequencing in blood plasma
CN110942808A (en) Prognosis prediction method and prediction system based on gene big data
KR20200109544A (en) Multi-cancer classification method by common significant genes
KR20220133516A (en) Method for detecting tumor derived mutation from cell-free DNA based on artificial intelligence and Method for early diagnosis of cancer using the same
Akter et al. A data mining approach for biomarker discovery using transcriptomics in endometriosis
KR20220160805A (en) Method for early diagnosis of cancer using cell-free DNA by modeling tissue-specific chromatin structure based on Artificial intelligence
CN114664451A (en) Model for predicting postoperative discharge readiness of rectal cancer patient
CN111733252A (en) Characteristic miRNA expression profile combination and early gastric cancer prediction method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant