CN115295074B

CN115295074B - Application of gene marker in malignant pulmonary nodule screening, construction method of screening model and detection device

Info

Publication number: CN115295074B
Application number: CN202211220583.XA
Authority: CN
Inventors: 邵阳; 吴雪; 包华; 刘睿; 吴舒雨; 吴旻; 杨珊珊; 刘思思; 郑丽娟
Original assignee: Nanjing Shihe Medical Devices Co ltd; Nanjing Shihe Gene Biotechnology Co ltd
Current assignee: Nanjing Shihe Medical Devices Co ltd; Nanjing Shihe Gene Biotechnology Co ltd
Priority date: 2022-10-08
Filing date: 2022-10-08
Publication date: 2022-12-16
Anticipated expiration: 2042-10-08
Also published as: CN115295074A; CN116052768A

Abstract

The invention provides application of a gene marker in malignant pulmonary nodule screening, a construction method of a screening model and a detection device, wherein a high-throughput sequencing result is subjected to length ratio of benign and malignant difference DNA fragment fragments of pulmonary nodules with high risk represented by an image, sequence read proportion and 1Mb window copy number change analysis at break points at 5 ends of read, a 16bp tumor new and short sequence and nucleosome coverage mode, and automatic machine learning is utilized to construct a multi-feature multi-algorithm integration model, so that benign and malignant pulmonary nodules with high risk represented by the image are predicted, noninvasive accurate diagnosis of malignant pulmonary nodules is realized, and unnecessary benign pulmonary nodule resection operations are reduced.

Description

Application of gene marker in malignant pulmonary nodule screening, construction method of screening model and detection device

Technical Field

The invention relates to a lung nodule (radiographical high grade lung nodule) good and malignant early sieve which shows high risk to images, belonging to the technical field of molecular biomedicine.

Background

Lung cancer is one of the most well-established cancers in the world, and high-risk groups include those aged over 65 years and having one or more of the following risk factors. The risk factors include: severe smoking, once with a history of smoking, family history, receiving thoracic cavity radiation therapy, and carcinogens. Patients are generally diagnosed in the middle and late stages (stage III, IV) of lung cancer due to the lack of overt symptoms in the early stages of lung cancer. However, a number of studies have shown that lung cancer patients diagnosed at an early stage can have a higher survival rate. Patients diagnosed at stage one (stage I) of lung cancer have a 13-fold improvement in five-year survival over patients diagnosed at stage four (stage IV) of lung cancer. Therefore, early detection and diagnosis of lung tumors is crucial to improve survival of lung cancer patients.

Low-Dose chest Computed Tomography (LDCT) detection of lung nodules is today the most common diagnostic modality for lung tumor discovery. The lung nodules determined by imaging are subjected to surgical resection, so that the lung cancer death rate can be effectively reduced by 20% -39%. However, approximately 15% -35% of lung nodules, which are judged as high risk lung nodules in the initial LDCT image presentation, are ultimately identified as pathologically harmless after surgical resection. Therefore, the imaging test has certain limitations, and the diagnosis of malignant lung tumor is performed only according to the result of the imaging test, which increases unnecessary operations, causes unnecessary risks of operations and complications to patients, and increases the burden of medical expenses. Therefore, it is important to judge whether or not lung nodules are benign or malignant, which is judged as a high risk group of lung cancer only by imaging.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: in the prior art, a noninvasive detection means is absent in the process of diagnosing benign and malignant lung cancer nodules, so unnecessary operations are caused, and the burden of patients is increased.

In the technical scheme of the patent, WGS sequencing is provided for plasma sample cfDNA, fragment information is obtained by performing high-throughput sequencing on a result, and a difference DNA Fragment length ratio (Fragment size ratio) of benign and malignant pulmonary nodules, a read ratio of 5-end breaking point sequences (break point motif), a 1MB window copy number variation (1 MB-bin copy number variance), a 16bp tumor new short sequence (16 bp neometers) and a Nucleosome coverage pattern (Nucleosome coverage patterns) are performed, and an elastic network logistic regression model (glm), extreme gradient integration (extreme gradient boosting, xgboost), a random forest (random forest) and a neural network (neural network) are utilized to construct a multi-feature multi-node precision model by utilizing an automatic machine learning model, so as to realize a noninvasive diagnosis of the malignant pulmonary nodules.

The specific technical scheme is as follows:

the application of the gene marker in preparing a malignant pulmonary nodule screening reagent;

the gene marker comprises:

a first marker: comparing the cfDNA fragments to the number of short reads and the number of long reads in different windows of the reference genome;

a second marker: the proportion of m base fragments aligned to the 5' end of the reference genome of different kinds of cfDNA fragments among all the base fragments;

a third marker: copy number in different windows on chromosomes in WGS data;

fourth marker: tumor new short sequence proportion;

a fifth marker: nucleosome coverage pattern.

The first marker is obtained by the following steps: and comparing the reading data results to a reference genome, dividing the reference genome into a plurality of windows, and respectively obtaining the ratio of the number of short reads to the number of ultra-long reads in each window range.

The second marker is obtained by the following steps: taking m base data of the 5' end in the reading data as a base fragment set, and obtaining the proportion of various base fragments in all the fragments.

The third marker is obtained by the following steps: the reference genome is divided into a plurality of windows, and copy number data of the WGS data in different windows on chromosomes 1 to 22 are obtained, respectively.

The fourth marker is obtained by the following steps:

generating a short sequence set A with the length of 16bp by an exhaustion method; exhaustively exhausting a short sequence set B with the length of 16bp in a human reference gene sequence, and defining the set B as an invalid seed after removing data in the set A;

obtaining WGS sequencing results of samples of different cancer species from a cancer database, and extracting base substitution mutation which appears for many times; according to the positions of the base substitutions, finding an invalid subsequence set C containing the base substitutions from the invalid seeds;

obtaining base substitution mutation with frequency more than 0.01 in east Asian population; according to the positions of the base substitutions, finding an invalid subsequence set D containing the base substitutions from the invalid seeds; eliminating invalid subsequences of the set D from the set C, and defining the invalid subsequences as new short sequences;

counting the number of samples in the samples which can read any new short sequence, searching the number of samples containing the new short sequences aiming at each new short sequence, and comparing the number of samples of each new short sequence with the total number of samples which can read any new short sequence.

The fifth marker is obtained by the following steps:

obtaining transcription factors from a GTRD database, and excluding the transcription factors which do not have known transcription sites in a CIS-BP database;

taking the range of-5 kb to +5kb near the transcription site of the obtained transcription factor as windows, obtaining fragments with the length of 100-220bp which can be compared to the windows, and sequentially carrying out GC correction and sequencing deep smoothing treatment on the read data in the windows to obtain a coverage mode curve of each transcription factor;

for each transcription factor, the following three features were obtained, collectively as the nucleosome coverage pattern:

1) For all transcription sites of the transcription factor, calculating the average depth from the upper end 1kb to the lower end 1kb of the transcription sites;

2) For the obtained coverage pattern curve, obtaining an amplitude value of a curve trough as the center depth of the transcription factor;

3) And performing fast Fourier transform on the obtained coverage mode curve to obtain the amplitude value of the highest point of the nucleosome amplitude signal.

The method for constructing the malignant lung nodule screening model comprises the following steps:

step 1, extracting cfDNA from samples of a positive group and a control group and sequencing to obtain reading data;

step 2, comparing the reading data results to a reference genome, dividing the reference genome into a plurality of windows, and respectively obtaining the short reading quantity ratio and the long reading quantity ratio in each window range as a first characteristic set;

step 3, taking m base data of the 5' end in the reading data as a base fragment set, and obtaining the proportion of various base fragments in all the fragments as a second characteristic set;

step 4, dividing the reference genome into a plurality of windows, and respectively obtaining copy number data in the range of each window as a third feature set;

step 5, taking the ratio of the number of samples reading the 16bp new short sequence to the total number of samples capable of reading any new short sequence as a fourth feature set;

step 6, using the nucleosome coverage pattern characteristics of the selected transcription factor as a fifth characteristic set;

and 7, taking the first, second, third, fourth and fifth feature sets as initial feature values together, taking the initial feature values as model feature vectors to be input into the classification model, taking the benign and malignant lung nodules as output values, training the model, and obtaining the early-screening model.

The step 3 comprises:

step 3-1, dividing the reference genome into a plurality of windows, and respectively obtaining the number of long reads and the number of short reads within the range of each window;

and 3-2, standardizing the short reading number and the long reading number of all windows in the step 3-1, and taking the ratio of the standardized short reading number and long reading number as a first characteristic value.

In the step 3-1, the window size is 5Mb, and 541 windows are divided.

The short read is 100-150bp in length, and the long read is 151-220bp.

In the step 3, m is 4.

In the step 4, the window size is 1Mb, and 2475 windows are divided.

In step 5, the step of obtaining the fourth feature set is as follows:

step 5-1, generating a short sequence set A with the length of 16bp by an exhaustion method; exhaustively exhausting a short sequence set B with the length of 16bp in a human reference gene sequence, and defining the set B as an invalid seed after removing data in the set A;

step 5-2, obtaining WGS sequencing results of samples of different cancer species from a cancer database, and extracting base substitution mutation appearing for many times; according to the positions of the base substitutions, finding an invalid subsequence set C containing the base substitutions from the invalid subsequence;

step 5-3, obtaining base substitution mutation with frequency more than 0.01 in the east Asia population; according to the positions of the base substitutions, finding an invalid subsequence set D containing the base substitutions from the invalid subsequence; eliminating invalid subsequences of the set D from the set C, and defining the invalid subsequences as new short sequences;

and 5-4, counting the number of samples which can read any new short sequence in the samples, searching the number of samples containing the new short sequences for each new short sequence, and taking the ratio of the number of samples of each new short sequence to the total number of samples which can read any new short sequence as a fourth feature set of the model.

The cancer database is a PCAWG database.

The different cancer species include intestinal cancer, lung cancer, breast cancer, gastric cancer, prostate cancer and liver cancer.

Base substitution mutations in the east asian population were obtained from the gnomAD database.

The step 6 comprises the following steps:

step 6-1, obtaining transcription factors from a GTRD database, and excluding the transcription factors which do not have known transcription sites in a CIS-BP database;

step 6-2, taking the range of-5 kb to +5kb near the transcription site of the transcription factor obtained in the step 6-1 as a window, obtaining fragments with the length of 100-220bp which can be compared to the windows, and sequentially carrying out GC correction and sequencing deep smoothing treatment on the read data in the window to obtain a coverage mode curve of each transcription factor;

step 6-3, for each transcription factor, the following three features were obtained, collectively as nucleosome coverage pattern features:

2) Obtaining the amplitude value of the trough of the curve as the central depth of the transcription factor for the obtained coverage mode curve;

In the step 7, the step of classifying the model includes:

step 7-1, inputting the first, second, third, fourth and fifth feature sets into different classifier models respectively, training the models, and obtaining one or more optimal classifier models respectively aiming at the first, second, third, fourth and fifth feature sets;

and 7-2, performing secondary set training on the optimal classifier models of the first, second, third, fourth and fifth feature sets obtained in the step 7-1 to construct an integrated classifier model.

The different classifier models are selected from elastic net regression (glm), extreme gradient boosting (xgboost), random forest (random forest), deep learning neural network (deep learning, NN).

In the quadratic ensemble training, a Generalized Linear Model (GLM), a polar gradient boost Xgboost Model or a deep learning regression Model are used.

A malignant lung nodule detection apparatus comprising:

the sequencing module is used for extracting and sequencing cfDNA of the samples of the positive group and the control group to obtain reading data and WGS sequencing data;

the first characteristic acquisition module is used for comparing the reading data results to a reference genome, dividing the reference genome into a plurality of windows, and respectively obtaining the ratio of the number of short reads to the number of ultra-long reads in each window range as a first characteristic set;

the second characteristic acquisition module is used for taking m base data of the 5' end in the reading data as a base fragment set and obtaining the proportion of various base fragments in all the fragments as a second characteristic set;

the third characteristic acquisition module is used for dividing the reference genome into a plurality of windows and respectively acquiring copy number data of the WGS data in different windows on the chromosome as a third characteristic set;

the fourth characteristic acquisition module is used for taking the proportion of the number of samples which read the 16bp new short sequence to the total number of all samples which can read any new short sequence as a fourth characteristic set;

the fifth characteristic acquisition module analyzes the nucleosome coverage pattern characteristic of the selected transcription factor to serve as a fifth characteristic set;

and the prediction module is used for taking the first, second, third, fourth and fifth feature sets as initial feature values together, taking the initial feature values as model feature vectors to be input into the classification model, taking the benign and malignant lung nodules as output values, training the model and obtaining the early-screening model.

Drawings

FIG. 1 is a schematic diagram of a model building process;

FIG. 2 is a difference distribution plot of the largest contributing value feature among the various features;

FIG. 3 is a graph of AUC curves for models using individual features and models using all of the features in the training set alone;

FIG. 4 is an AUC curve for a model using all features on the validation set;

FIG. 5 is a graph of the predicted score of a classifier after the set of all models on the training set;

FIG. 6 is a graph of the predicted score of the classifier after validating all of the sets of models on the set.

Detailed Description

The calculation method of the invention is detailed as follows:

the invention firstly needs to carry out the steps of extraction, library construction, sequencing and the like of cfDNA from blood samples. The extraction and library construction method is not particularly limited, and can be adjusted from the extraction methods in the prior art. The base information of cfDNA can be obtained using a sequencing technique in the related art in the sequencing process here.

The purpose of the model in this patent is to distinguish between benign lung nodules (benign lung nodules) and malignant lung nodules (malignant lung nodules). Classifying the samples, and regarding the lung nodule patients judged to be high-risk in LDCT image detection, taking the patients judged to be benign lung nodules according to subsequent postoperative pathology as a control group, and taking the patients judged to be malignant lung nodules as a positive group.

The data set conditions adopted in the model construction process of the invention are as follows:

extraction and sequencing method of plasma cfDNA sample:

before LDCT image diagnosis, a liquid biopsy is performed on a patient. A purple blood collection tube (EDTA anticoagulation tube) is adopted to collect 10ml of whole blood sample of a patient, plasma is timely centrifugally separated (within 2 hours), and the sample is transferred to a laboratory for analysis under the condition of refrigeration and preservation at the temperature of 80 ℃ below zero. After transfer to the laboratory, plasma samples were subjected to ctDNA extraction using QIAGEN plasma DNA extraction kit according to the instructions. And establishing a library for the collected cfDNA sample, and performing WGS-5-times sequencing. After the off-line data is obtained, the data is compared to the human reference genome to obtain the base data information of the corresponding reading.

The model establishing process of the patent mainly comprises the following steps:

step 2, extracting and sequencing cfDNA of the samples of the positive group and the control group to obtain a read data;

step 3, comparing the reading data results to a reference genome, obtaining the number of the reads in different length intervals in different window ranges on the reference genome, and taking the ratio of the number of the reads with different lengths as a first characteristic value;

step 4, comparing the reading data result to a reference genome to obtain the position of the 5' end of the reading on the reference genome; obtaining sequence data of m bp bases at the upstream and downstream of the position as a base fragment set; taking the proportion of each obtained base fragment in all the fragments as a second characteristic value;

step 5, dividing the reference genome into a plurality of windows, and respectively obtaining copy number data in the range of each window as a third characteristic value;

step 6, taking the ratio of the read reading of the 16bp new short sequence to the total reading as a fourth characteristic;

step 7, analyzing the nucleosome coverage mode of the selected transcription factor as a fifth characteristic;

step 8, inputting the model feature vectors of the samples of the positive group and the control group into a first layer model, selecting 5 models with the best AUC corresponding to each feature, and selecting 25 feature models in the first layer model;

step 9, inputting the 25 models selected in the step 9 into a second-layer integrated model, outputting three integrated models with the top AUC sequence, and taking the average value of the predicted possibility results output by the three integrated models as a final judgment result;

the total five characteristic values in this patent are detailed as follows:

DNA Fragment Size Ratio (FSR)

For the size ratio of DNA fragments, it reflects the distribution characteristics of the length size of cfDNA reads in benign and malignant tumors. Performing machine learning to establish a prediction model by using the ratio of the short DNA fragment to the long DNA fragment, and later (benign lung nodes) and malignant lung nodules (malignant lung nodes);

the cfDNA read length data is obtained by the following method: the quality, length and alignment position information of each read was recorded in aligned bams, and the human reference genome selected for the hg19 sequence provided by University of California, cruz (University of California, santa Cruz, UCSC). Cutting a human reference genome into 541 windows according to the length of 5Mb, respectively counting the number of short reads (100 bp-151 bp) and the number of long reads (151 bp-220 bp) in each window, and respectively carrying out standardization conversion on the number of each read according to the counting results of the number of the reads in all the windows, namely, a standardization value = (original value-average value)/standard deviation. This results in 541 sets of numbers of reads of different lengths.

2. Number of reads at Breakpoint of 5' end of reads in terms of percentage of sequence reads (Breakpoint Motifs, BPM)

The human reference genome is a DNA double-helix structure and is linked by hydrogen bond through base complementary pairing; in the normal aging and cancer progression processes, the pH value of the environment around the five cells changes, so that the complementary hydrogen bonds of the basic groups are destroyed and the breakage occurs; the base sequences at the break are different, and the ratio of sequences containing information on the sequences at different breaks is also different. The collection method comprises the following steps: in the aligned bam, the basic information and the aligned position of each read are recorded, the 4bp sequences around the breakpoint of the human reference genome sequence coordinate where the 5' end of each read is located are confirmed, the read number of 8bp length sequences (4 × 8=65536 in total) at each breakpoint is counted, and 65536 breakpoint site sequence read ratios are calculated, for example, AAAAAAAA read ratio = aaaaaaaaaa read number/total breakpoint site sequence read numbers.

3.1 Mb Window Copy Number Variation (1 Mb-Bin Copy Number Variation, CNV)

Copy number changes are highly correlated with individual cancers, and although it has been possible to distinguish them by detecting copy number changes in a portion of the cancer-associated genes or in a particular genomic interval, there are other rare or unknown genes or intervals that can provide information about potential copy number changes. The collection method comprises the following steps: firstly, collecting WGS data of 30 healthy people, dividing a reference gene chromosome 1-22 into windows in a length of 1Mb in a non-overlapping mode, calculating the reading depth in each window for each sample by using a bdtools coverage, correcting according to the GC content and average comparison capability record (UCSC BigWig file) of each window, and taking the median depth of 30 healthy people in each window as a representative to obtain a population comparison base line of 2475 window reading depths; for each sample to be detected, 2475 pieces of window individual reading depth information are obtained in the same way, and the copy number change logarithm of each window, namely log2 (the depth of the sample to be detected after being corrected and homogenized/the depth of the group baseline after being corrected and homogenized) is constructed by using a Hidden Markov Model (HMM) and the group baseline contrast depth of each window, so that the copy number change information of each sample to be detected is obtained.

4.16bp tumor short sequence (16 bp Neomers, NEO)

Nullomer (Nullomer) refers to a short sequence of DNA that is not present in the human genome, and a 16bp tumor novel short sequence is a subset of nulls, specifically nulls of 16bp in length that are not present in the human genome but are repeatedly found in the genome of tumor tissue.

The characteristic value is obtained in the following way in the patent:

first, an exhaustive method was used to generate all possible short sequence sets A of 16bp in length. And in a human reference gene sequence (hg 19 version), 1bp is used as a sliding window, an exhaustive algorithm is used for searching all short sequence sets B with the length of 16bp and the occurrence times thereof, and the 16bp short sequences appearing in the set A are defined as the nulling seeds.

The present patent focuses on obtaining the WGS mutation results of 2577 patients with 6 different types of cancer (intestinal cancer, lung cancer, breast cancer, gastric cancer, prostate cancer, liver cancer) by analyzing the PCAWG database (https:// dcc. Icgc.org/PCAWG), and extracting 977 multiple base substitutions (occurring at least twice) from them. And extracting a set C of all possible invalid sub-short sequences comprising the base substitution by using an exhaustive method according to the position of the base substitution.

Through a gnomAD (https:// gnomAD. Broadproperty. Org /) database, mutation sites of base substitution with the frequency of east Asian population being more than 0.01 are collected, and according to the positions of the mutation sites, a short sequence set D containing the mutation sites is found from an invalid seed and is used as a common invalid seed set for collecting the east Asian population. And eliminating the invalid subsequence of the set D from the set C to obtain 4616 new 16bp tumor-related short sequences. For the 4616 new short sequences, firstly, the number of samples capable of reading any one 4616 new short sequences in the samples is counted, then, for each new short sequence, the number of samples containing the new short sequences is searched, and the ratio of the number of samples of each new short sequence to the total number of samples capable of reading any new short sequence (4616 ratio values in total) is used as the fourth feature of the model.

5. Nucleosome coverage Pattern (Nucleosome coverage patterns, NCP)

The transcription factors are selected from GTRD database (https:// rd. Bieuml. Org/# |) (v 21.12), the transcription factors which do not have known transcription sites in CIS-BP database (http:// cisbp. Ccbr. Utono. ca /) (v 2.00) are excluded, and 334 transcription factors with more than 10000 high-matching sites are selected.

For the transcription factors obtained above, using the range of-5 kb to +5kb near the transcription site in these target transcription factors as windows, fragments of 100-220bp in length that can be aligned into these windows were obtained. For these fragments, GC correction was performed and the final coverage pattern curve for each transcription factor was obtained using a Savitzky-Golay filter flattening curve with a polynomial power of 3 for the sequencing depth.

After obtaining the coverage pattern curve described above, three features were extracted for each transcription factor:

1) For all transcription sites of the transcription factor, the average depth from the upper 1kb to the lower 1kb of these transcription sites is determined;

2) The center depth of the transcription factor;

These three features are collectively taken as the eigenvalues of the nucleosome coverage pattern.

Through the above data acquisition, initial data vectors of the five types of data can be obtained, respectively. Then, a corresponding calculation method is designed, in the patent, a conventional classifier algorithm can be adopted to classify through the characteristic values, the characteristic values are input into a classifier, and the probability value of malignant lung nodules is used as output. The classifier models adopted by the optimization in the patent comprise the following four types, and when the classifier is optimized, sub models with different model parameters under the same model are simultaneously generated for screening the sub models. The four main models include:

1. elastic network regression model (glm)

The elastic network regression model is a common algorithm in machine learning, is a model for fitting generalized linear simulation by punishing maximum likelihood, and combines L2 regularization of ridge regression and L1 regularization algorithm of LASSO regression. The regularization path is calculated against the lasso or elastic network penalty at the value grid of the regularization parameter λ, solving the over-fitting problem in the regression. The hyper-parameter alpha controls the distribution of regularization L1, L2.

2. Extreme gradient boost (xgboost)

The optimization algorithm is an optimization algorithm of an integrated thought addition model based on a Gradient Boosting Decision Tree (GBDT). The method is developed by using a second-order Taylor formula, a loss function is optimized, the calculation accuracy is improved, a model is simplified by using a regular term, overfitting is avoided, and parallel calculation can be performed by using a Blocks storage structure.

The XGboost and learner used in this patent is a tree model. When the depth of the tree increases, the complexity of the tree increases, the model can be better trained, and the overfitting problem can also be caused, the hyperparameter max _ depth is used for controlling the maximum depth of the tree, and the hyperparameter min _ rows is used for controlling the minimum number of samples of each leaf node.

3. Random forest (random forest)

Random forests are a powerful classification and regression tool for high-dimensional and multicollinearity situations. When a group of data sets are provided, the random forest can randomly extract partial information to generate a group of decision-making forests for assisting classification or regression, node splitting attributes are made, and random extraction is continuously repeated until splitting can not be performed; and finally, combining all the split attribute results to obtain a final prediction result. The random forest also controls the complexity of the tree through super parameters such as max _ depth, min _ rows and the like.

4. Deep learning neural network model (deep learning, NN)

The neural network consists of inputs, weights, biases or thresholds and outputs, and the output of any single node is above a specified threshold, then that node is activated and the data is sent to the next layer of the network. Each node of the input layer and each node of the hidden layer are subjected to point-to-point calculation by using a weighted summation and activation method. Each value calculated using the hidden layer is calculated using the same method and output layer. The method has the advantages of high classification accuracy, strong parallel distribution processing capability and strong distribution storage and learning capabilities.

The deep learning neural network model used in the patent is a multi-layer feedforward neural network (multi-layer fed forward neural network), the neural network structure of the feedforward neural network is a foremost input layer, a middle hidden layer and a last output layer, the middle can contain a plurality of hidden layers (hidden layers) with multiple layers and complexity, network signals are transmitted from the input layer to the output layer in a one-way mode, and neurons of each layer are only connected with neurons of the previous layer to receive information of the previous layer. The deep learning neural network is optimized and trained by adopting a stochastic gradient descent algorithm, estimates the error gradient of the current state of the model by using data in a training data set, and then updates the weight of the model by using the error. The amount of updating weights during training is the learning rate (epsilon) used to control the speed of model adaptation to the problem, the learning rate decay index (rho), and also the rate of model learning, are all important parameters that can be configured in neural network training.

In addition, the patent also adopts a Random Grid Search Parameters algorithm for optimizing the model.

Random search is a common method of machine learning hyperparametric optimization. The random search is to randomly extract parameter values in a specific model parameter range and select an optimal parameter combination from a plurality of sampled parameter values. Rather than trying all possible combinations, the method selects a certain number of random combinations of a random value for each hyper-parameter. Compared with manual tuning and network search for model tuning, random search can achieve a good effect with fewer search times and provides a more efficient solution (especially under the condition of a large number of parameters).

In the implementation process of the four algorithms in the patent, the following \8230; (algorithm type) (or which algorithm toolkit is called at once) is adopted specifically

The hyper-parameters of the four algorithms used in this patent are shown in the following table:

after acquiring the five types of initial data information of 247 cases of malignant lung nodules and 60 cases of benign lung nodules of patients, using a Fragment Size Ratio (FSR) statistical result as an input value (an input vector of each sample comprises a characteristic value formed by 541 read fragment size ratios), classifying the malignant lung nodule samples and the benign lung nodule samples through four classification models respectively, in the screening process, performing parameter and structure change on the four types of models through random search of hyper-parameters respectively, using the parameters as a sub-model to perform data training and model suggestion, and selecting five optimal sub-models of the characteristics, wherein during screening, an AUC curve of a training set of the models is used as an index of a classification effect; similarly, after collecting the data proportion information of the breakpoint sequence reading at the 5 'end of the DNA fragment of the benign pulmonary nodule patient and the malignant pulmonary nodule patient, classifying the malignant pulmonary nodule sample and the benign pulmonary nodule sample through five characteristics by using the breakpoint sequence proportion at the 5' end of the DNA fragment (65536 kinds) as an input value through four classification models, and selecting five optimal submodels of the characteristics (the specific model optimization and the hyper-parameter adjustment process are the same as above). Similarly, copy number variation (2475), new short sequences (4616) and nucleosome coverage patterns (1002) are also used as input values, classified by four types of models respectively, and the optimal five submodels are selected for each feature (the specific model optimization process is the same as above). Through the above calculation process, a total of 5 × 5=25 model calculation results are obtained. In each calculation, the contribution value of each feature vector to the classification result may be obtained.

The 5 optimal models (25 models in total) selected by each feature are respectively as shown in the following table:

the contribution values and the feature variables of the optimal model selected by each feature are ranked as follows:

DNA Fragment Size Ratio (FSR) Deep Learning neural network model (Deep Learning, NN):

2.5' end break point sequence (BPM) read elastic network regression model (GLM):

3. copy Number Variation (CNV) Deep Learning neural network model (Deep Learning, NN)

4.16bp tumor New short sequence (NEO) XgBoost model:

5. nucleosome Coverage Pattern (NCP) XgBoost model:

in order to further improve the prediction performance of the classifier, secondary set training (stacking) is carried out on the 25 training model results. Stacking is an ensemble learning technique by applying 25 low-level classifiers (1) ^st -level base model) to do meta learning again (2) ^nd Level meta-learning), collecting the characteristics of each bottom-layer classifier, and finding out an optimal integration mode, thereby improving the model prediction performance. Finally, the training algorithm used by the Stacking system is a Generalized Linear Model (GLM), a polar gradient boost Xgboost Model and a deep learning regression Model.

The 3 optimal integration models and feature model importance (variables import) are shown in the following table:

the optimal stacking model with the highest AUC is a Generalized Linear Model (GLM), a relation between a mathematical expectation value of a response variable and a prediction variable of a linear combination is established through a coupling function, and the 25 training models are converted into a final linear equation: ALL Stacked = Intercept + a × CNV model 1 + B × CNV model 2 + C × CNV model 3 + D × CNV model 4 + E × CNV model 5+ F × BPM model 1 + G × BPM model 2 + H × BPM model 3 + I × M model 4 + J × BPM model 5+ K × FSR model 1 + L × FSR model 2 + M × FSR model 3 + N × FSR model 4 + O × FSR model 5+ P × O model 1 + Q × NEO model 2 + R × O3 + S × NEO model 4 + T × N U5 + NCP model 1 + P × ncv model.

The specific coefficients are as follows:

each feature has a certain prediction effect under different training algorithms, and the prediction effect of the feature is improved by training a single feature in a secondary set. And finally, the AUC of the prediction result for the training set is up to 0.9474, the AUC of the prediction result for the verification set is up to 0.931, the sensitivity is 85%, and the specificity is 98.7%.

Claims

1. The application of the gene marker in preparing a malignant pulmonary nodule screening reagent is characterized in that the gene marker comprises:

a third marker: copy number in different windows on chromosomes in WGS data;

a fourth marker: the new short sequence proportion of the tumor;

fifth marker: nucleosome coverage pattern;

the fourth marker is obtained by the following steps:

generating a short sequence set A with the length of 16bp by an exhaustion method; exhausting all short sequence sets B with the length of 16bp in the human reference gene sequence, and defining the sets B as invalid seeds after removing the data of the sets B from the sets A;

obtaining WGS sequencing results of samples of different cancer species from a cancer database, and extracting base substitution mutation which appears for many times; according to the positions of the base substitutions, finding an invalid subsequence set C containing the base substitutions from the invalid subsequence;

obtaining base substitution mutation with frequency more than 0.01 in east Asia population; according to the positions of the base substitutions, finding an invalid subsequence set D containing the base substitutions from the invalid subsequence; eliminating invalid subsequences of the set D from the set C, and defining the invalid subsequences as new short sequences;

counting the number of samples in the samples which can read any new short sequence, searching the number of samples containing the new short sequences aiming at each new short sequence, and comparing the number of samples of each new short sequence with the total number of samples which can read any new short sequence;

the fifth marker is obtained by the following steps:

obtaining transcription factors from a GTRD database, and excluding the transcription factors with known transcription sites which are not in a CIS-BP database;

2. The use of claim 1, wherein said first marker is obtained by: comparing the reading data results to a reference genome, dividing the reference genome into a plurality of windows, and respectively obtaining the ratio of the number of short reads to the number of ultra-long reads within the range of each window;

the second marker is obtained by the following steps: taking m base data of the 5' end in the reading data as a base fragment set, and obtaining the proportion of various base fragments in all fragments;

the third marker is obtained by the following steps: the reference genome is divided into a plurality of windows, and copy number data in different windows on chromosomes 1 to 22 in the WGS data are obtained separately.

3. The method for constructing the malignant lung nodule screening model is characterized by comprising the following steps of:

step 1, extracting and sequencing cfDNA of samples of a positive group and a control group to obtain reading data;

step 2, comparing the reading data results to a reference genome, dividing the reference genome into a plurality of windows, and respectively obtaining the ratio of the number of short reads to the number of ultra-long reads in each window range as a first feature set;

step 4, dividing the reference genome into a plurality of windows, and respectively obtaining copy number data of WGS data in different windows on a chromosome as a third feature set;

step 6, analyzing nucleosome coverage pattern characteristics of the selected transcription factor to serve as a fifth characteristic set;

step 7, taking the first, second, third, fourth and fifth feature sets as initial feature values together, taking the initial feature values as model feature vectors to be input into a classification model, and taking benign and malignant lung nodules as output values to train the model to obtain an early screening model;

the fourth feature set is obtained as follows:

step 5-1, generating a short sequence set A with the length of 16bp by an exhaustion method; exhausting all short sequence sets B with the length of 16bp in the human reference gene sequence, and defining the sets B as invalid seeds after removing the data of the sets B from the sets A;

step 5-2, obtaining sample WGS sequencing results of different cancer species from a cancer database, and extracting base substitution mutation which appears for many times; according to the positions of the base substitutions, finding an invalid subsequence set C containing the base substitutions from the invalid subsequence;

step 5-3, obtaining base substitution mutation with frequency more than 0.01 in the east Asian population; according to the positions of the base substitutions, finding an invalid subsequence set D containing the base substitutions from the invalid subsequence; eliminating invalid subsequences of the set D from the set C, and defining the invalid subsequences as new short sequences;

step 5-4, counting the number of samples which can read any new short sequence in the samples, searching the number of samples containing the new short sequences aiming at each new short sequence, and taking the ratio of the number of samples of each new short sequence to the total number of samples which can read any new short sequence as a fourth feature set of the model;

the cancer database is a PCAWG database;

different cancer species refer to intestinal cancer, lung cancer, breast cancer, gastric cancer, prostate cancer and liver cancer;

base substitution mutations in the east asian population were obtained from the gnomAD database;

the step 6 comprises the following steps:

4. The method for constructing a malignant lung nodule screening model according to claim 3, wherein the step 3 comprises: step 3-1, dividing the reference genome into a plurality of windows, and respectively obtaining the number of long reads and the number of short reads within the range of each window; and 3-2, standardizing the short reading number and the long reading number of all windows in the step 3-1, and taking the ratio of the standardized short reading number and long reading number as a first characteristic value.

5. The method as claimed in claim 4, wherein the window size in step 3-1 is 5Mb, and 541 windows are defined.

6. The method of claim 3, wherein the short reads are from 100 to 150bp long reads and the long reads are from 151 to 220bp long reads;

in the step 3, m is 4;

in the step 4, the window size is 1Mb, and 2475 windows are divided.

7. The method of claim 3, wherein in step 7, the step of classifying the model comprises:

step 7-2, performing secondary set training on the optimal classifier models of the first, second, third, fourth and fifth feature sets obtained in the step 7-1 to construct an integrated classifier model;

the different classifier models are selected from an elastic network regression model, a polar gradient lifting model, a random forest model and a deep learning neural network model; a generalized linear model, a polar gradient boosting Xgboost model or a deep learning regression model is adopted in the secondary set training.

8. A malignant lung nodule detecting apparatus, comprising:

the prediction module is used for taking the first, second, third, fourth and fifth feature sets as initial feature values together, taking the initial feature values as model feature vectors to be input into the classification model, taking the benign and malignant lung nodules as output values, and training the model to obtain an early-screening model;

the fourth feature set is obtained as follows:

step 5-4, counting the number of samples which can read any new short sequence in the samples, searching the number of samples containing the new short sequences for each new short sequence, and taking the ratio of the number of samples of each new short sequence to the total number of samples which can read any new short sequence as a fourth feature set of the model;

the cancer database is a PCAWG database;

the different cancer species include intestinal cancer, lung cancer, breast cancer, gastric cancer, prostate cancer and liver cancer;

the step 6 comprises the following steps:

step 6-2, taking the range of-5 kb to +5kb near the transcription site of the transcription factor obtained in the step 6-1 as a window, obtaining fragments which can be compared to the window and have the length of 100-220bp, and sequentially carrying out GC correction and sequencing deep smoothing treatment on read data in the window to obtain a coverage mode curve of each transcription factor;