CN118380156A

CN118380156A - Model construction method and related device for lung nodule malignancy risk assessment

Info

Publication number: CN118380156A
Application number: CN202410814677.2A
Authority: CN
Inventors: 王俊奇; 何建行; 陈思彤; 梁恒瑞; 彭敏桦; 张晗; 樊鹏南
Original assignee: Jingzhi Future Guangzhou Intelligent Technology Co ltd
Current assignee: Jingzhi Future Guangzhou Intelligent Technology Co ltd
Priority date: 2024-06-24
Filing date: 2024-06-24
Publication date: 2024-07-23
Anticipated expiration: 2044-06-24
Also published as: CN118380156B

Abstract

The application discloses a model construction method and a related device for lung nodule malignant risk assessment, wherein the method comprises the following steps: obtaining exhalate samples of benign and malignant subjects of lung nodules, performing primary screening to obtain a differential marker set, combining a random forest algorithm and recursive feature elimination with a cross-validation algorithm, performing secondary screening on the differential marker set to obtain a diagnosis marker set, and constructing a lung nodule malignant risk assessment model by utilizing a VOC risk assessment method based on the diagnosis marker set and a plurality of samples to perform lung nodule malignant risk assessment through the model. Therefore, the analysis of the benign and malignant risks of the lung nodules is carried out in a mode of collecting expiration, and two times of screening are carried out according to the correlation between the expiration metabolite and the benign and malignant risks of the lung nodules, so that the accuracy and reliability of the analysis result of the malignant risks of the lung nodules are ensured, and a doctor is effectively assisted to judge the benign and malignant risks of the lung nodules of the object to be analyzed.

Description

Model construction method and related device for lung nodule malignancy risk assessment

Technical Field

The application relates to the technical field of computers, in particular to a model construction method and a related device for lung nodule malignancy risk assessment.

Background

Pulmonary nodules refer to small tumors or tumors found in lung tissue, typically less than 3 cm in diameter, and diagnosis of benign or malignant tumors in the lungs is of great significance for treatment and prognosis of patients.

In recent years, noninvasive detection methods based on exhaled breath have attracted considerable attention. Exhaled breath contains many Volatile Organic Compounds (VOCs), and these particular VOCs may be related to metabolic products resulting from the metabolism of body functions, inflammatory reactions, and changes in the immune system.

Traditional pulmonary nodule diagnostic methods rely primarily on imaging and tissue biopsy, but these methods have limitations such as the accuracy and invasiveness of distinguishing benign from malignant nodules, and thus there is currently no reliable, accurate and efficient method for assessing benign malignancy of pulmonary nodules.

How to evaluate the risk of malignancy of the lung nodule by collecting the exhale samples of the lung malignant nodule and benign nodule crowd and quantitatively and qualitatively analyzing the VOCs in exhale, and provide reliable auxiliary analysis for diagnosing the malignancy of the lung nodule is a problem which needs to be concerned.

Disclosure of Invention

In view of the above problems, the present application provides a method and related apparatus for modeling a pulmonary nodule malignancy risk assessment to assess risk of pulmonary nodule malignancy, and provide reliable auxiliary analysis for pulmonary nodule malignancy diagnosis.

In order to achieve the above object, the following specific solutions are proposed:

a model building method for lung nodule malignancy risk assessment, comprising:

obtaining an exhalation of a benign subject of a lung nodule and an exhalation of a malignant subject of a lung nodule as a plurality of samples;

performing primary screening on the benign and malignant lung nodule markers based on the exhales of the benign lung nodule subjects and the exhales of the malignant lung nodule subjects to obtain a differential marker set;

Performing secondary screening on the benign and malignant lung nodule markers of the differential marker set by combining a random forest algorithm with a recursive feature elimination and a cross-validation algorithm to obtain a diagnosis marker set;

And constructing a lung nodule malignancy risk assessment model by utilizing a VOC risk scoring method based on the diagnosis marker set and the plurality of samples.

Optionally, the performing secondary screening on the benign and malignant lung nodule markers on the differential marker set by combining a random forest algorithm and a recursive feature elimination and a cross-validation algorithm to obtain a diagnostic marker set includes:

Constructing a feature set characterized by each VOC in the differential marker set;

carrying out random forest modeling through the feature set to obtain an initialized random forest model, and calculating performance scores of the random forest model on a test set of the feature set;

calculating average reduction non-purity of each feature in the random forest model, and removing the feature with the minimum average reduction non-purity from a training set of the feature set to update the random forest model, wherein the test set and the training set of the feature set are alternated once each time the random forest model is updated once;

And calculating the performance score of the updated random forest model on the test set of the feature set, if the performance score is the same as the performance score of the random forest model before updating, determining the feature set corresponding to the updated random forest model as a diagnosis marker set, otherwise, returning to execute the step of calculating the average reduction of the non-purity of each feature in the random forest model, and eliminating the feature with the minimum average reduction of the non-purity on the training set of the feature set so as to update the random forest model.

Optionally, after calculating the performance score of the updated random forest model on the test set of feature sets, the method further comprises:

If the feature quantity in the feature set corresponding to the updated random forest model reaches the preset quantity, selecting the feature set corresponding to the random forest model with the highest performance score from the random forest models updated for multiple times as a diagnosis marker set.

Optionally, said calculating an average reduced non-purity for each feature in said random forest model comprises:

the average reduced non-purity of each feature in the random forest model is calculated using the formula:

Wherein, Is characterized byIs used for reducing the non-purity of the product,The total node number of the decision tree of the random forest model,For making features in decision tree tThe set of nodes that are to be split,Is characterized byReduced non-purity at the time of node j splitting,The keni is not pure for the parent node,The kene's unreliability for the left child node of the parent node,The kene's unreliability for the right child of the parent,Is the total number of samples in the parent node,For the number of samples of the left child node,For the number of samples of the right child node,As the data set of the node P,Is the radical of the node P of non-purity,In order to be a number of categories,Is the proportion of the ith category in the dataset.

Optionally, constructing a lung nodule malignancy risk assessment model using VOC risk scoring based on the set of diagnostic markers and the plurality of samples, comprising:

Counting the relative content of each VOC in the diagnosis marker set in each sample, and calculating a weight coefficient of each VOC affecting the accuracy of judging the malignant nodule;

Determining a high risk threshold and a low risk threshold according to the weight coefficient and the relative content of each VOC in the diagnosis marker set in each sample;

and taking the diagnosis marker set as a model risk assessment analysis factor, and constructing a lung nodule malignant risk assessment model based on a weight coefficient, the high risk threshold and the low risk threshold, wherein the weight coefficient influences malignant nodule judgment accuracy by all VOCs in the diagnosis marker set.

Optionally, the calculating a weight coefficient of each VOC affecting the accuracy of the determination of the malignant nodule includes:

constructing a linear regression prediction model, wherein the expression function of the linear regression prediction model is as follows:

Wherein, In order to predict the probability value(s),For the parameter vector of the linear regression prediction model,The input feature vector is composed of the relative content of each VOC in the diagnosis marker set in a single sample;

And taking the log likelihood loss function as an optimization target, and iteratively updating each parameter of the parameter vector of the linear regression prediction model by using a gradient descent algorithm until a preset convergence condition is reached, and determining each parameter of the parameter vector of the linear regression prediction model as a weight coefficient of each VOC which affects the determination accuracy of the malignant nodule.

Optionally, the determining the high risk threshold and the low risk threshold according to the weight coefficient and the relative content of each VOC in the diagnostic marker set in each sample includes:

Multiplying the weight coefficient of each VOC in the diagnosis marker set in each sample by the relative content of the VOC to obtain a risk assessment score of the VOC;

for each sample, accumulating risk assessment scores of all VOCs in the diagnostic marker set in the sample to obtain a disease risk value of the sample;

Determining outliers in the disease risk values of the respective samples, the outliers including a malignant lung nodule lower bound outlier and a benign lung nodule upper bound outlier;

Rejecting all lower limit outliers of malignant lung nodules from the diseased risk values of all samples, and detecting a threshold value with highest sensitivity of the malignant lung nodules from the diseased risk values of the rest samples as a high risk threshold value;

And eliminating all benign lung nodule upper limit outliers from the diseased risk values of all samples, and detecting the threshold with highest benign lung nodule specificity from the diseased risk values of the rest samples as a low risk threshold.

Optionally, the determining an outlier in the disease risk values of the respective samples includes:

constructing a data point set of disease risk values of each sample;

Calculating the number of data points in a preset neighborhood of the data points aiming at each data point in the data point set, and determining the data point as a core point if the number of the data points in the preset neighborhood of the data point is not less than a preset minimum point number;

For each core point in the data point set, determining data points of non-core points in the preset neighborhood of the core point as boundary points;

and determining data points which are neither core points nor boundary points in the data point set as outliers, and determining a disease risk value corresponding to each outlier as an outlier.

Optionally, the diagnostic marker collection comprises 13 VOCs, in particular 2 aromatics, 2 alkanes, 3 ketones, 2 aldehydes, 1 alkenes and 3 other classes of compounds, wherein,

The 2 aromatic compounds are selected from aromatic candidate marker sets, and the aromatic candidate marker sets comprise o-xylene, 1-methylnaphthalene, 3-ethyltoluene, benzene, ethylbenzene, propylbenzene, trimethylbenzene, 1-methyl-3-propylbenzene and p-xylene;

The 2 alkane compounds are selected from a set of alkane candidate markers, the set of alkane candidate markers comprising hexane, cyclohexane, 2, 4-dimethylheptane, 4-methyl octane, n-dodecane, octane, methylcyclohexane, propylcyclohexane, 2-methyl heptane, propane, butane, 2-methylpentane and pentane;

The 3 ketone compounds are selected from a ketone candidate marker set comprising 2-pentanone, 2-butanone, acetone, 2, 3-hexanedione and cyclohexanone;

The 2 aldehyde compounds are selected from an aldehyde candidate marker set, wherein the aldehyde candidate marker set comprises hexanal, nonanal, heptanal and octanal;

The 1 olefinic compound is selected from a set of olefinic candidate markers comprising isoprene, n-heptane, styrene and 1-octene.

Optionally, the process of auxiliary evaluation by the lung nodule malignant risk evaluation model includes:

Obtaining the expiration of the object to be tested;

and inputting the expiration of the object to be tested into the lung nodule malignant risk assessment model, and outputting a risk assessment result of the object to be tested.

Optionally, inputting the exhalate of the object to be tested to the lung nodule malignancy risk assessment model, and outputting a risk assessment result of the object to be tested, including:

inputting the exhalations of the object to be detected into the lung nodule malignancy risk assessment model to obtain the content value of each risk assessment analysis VOC in the exhalations of the object to be detected;

Multiplying the content value of each risk assessment analysis VOC by the weight coefficient of the risk assessment analysis VOC, which influences the malignancy nodule judgment accuracy, through the lung nodule malignancy risk assessment model to obtain the risk assessment score of the risk assessment analysis VOC;

Accumulating the risk assessment scores of all risk assessment analysis VOCs through the lung nodule malignancy risk assessment model to obtain a lung nodule benign malignancy comprehensive assessment score of the object to be detected;

and comparing the lung nodule benign and malignant comprehensive evaluation score with a high risk threshold and a low risk threshold through the lung nodule malignant risk evaluation model, and outputting a risk evaluation result of the object to be tested.

A model building apparatus for pulmonary nodule malignancy risk assessment, comprising:

An expired air sample acquisition unit for acquiring expired air of a benign subject of a lung nodule and expired air of a malignant subject of the lung nodule as a plurality of samples;

a differential marker screening unit, configured to perform primary screening of benign and malignant lung nodule markers based on the exhales of benign lung nodule subjects and the exhales of malignant lung nodule subjects, to obtain a differential marker set;

the diagnosis marker secondary screening unit is used for carrying out secondary screening on benign and malignant lung nodule markers on the differential marker set by combining a random forest algorithm and recursive feature elimination with a cross verification algorithm to obtain a diagnosis marker set;

And the risk assessment model construction unit is used for constructing a lung nodule malignancy risk assessment model by utilizing a VOC risk scoring method based on the diagnosis marker set and the samples.

Optionally, the diagnostic marker secondary screening unit comprises:

A feature set construction unit for constructing a feature set characterized by each VOC in the differential marker set;

The random forest modeling unit is used for carrying out random forest modeling through the feature set to obtain an initialized random forest model, and calculating the performance score of the random forest model on the test set of the feature set;

An average reduced non-purity calculation unit for calculating an average reduced non-purity of each feature in the random forest model;

The feature eliminating unit is used for eliminating the features with the minimum average reduction of the unrepeated degree on the training set of the feature set so as to update the random forest model, wherein the test set and the training set of the feature set are alternated once every time the random forest model is updated;

And the performance score calculating unit is used for calculating the performance score of the updated random forest model on the test set of the feature set, determining the feature set corresponding to the updated random forest model as a diagnosis marker set if the performance score is the same as the performance score of the random forest model before updating, and otherwise, returning to the average reduction non-purity calculating unit.

Optionally, the apparatus further comprises:

And the highest score feature set selecting unit is used for calculating the performance scores of the updated random forest models on the test sets of the feature sets, and selecting the feature set corresponding to the random forest model with the highest performance score from the random forest models updated for multiple times as a diagnosis marker set if the feature quantity in the feature set corresponding to the updated random forest model reaches the preset quantity.

Optionally, the average reduced non-purity calculating unit includes:

An average reduced-opacity calculation subunit for calculating an average reduced-opacity for each feature in the random forest model using the following formula:

Optionally, the risk assessment model building unit includes:

A relative content statistics unit for counting the relative content of each VOC in the diagnostic marker set in each sample;

The weight coefficient calculation unit is used for calculating a weight coefficient of each VOC which affects the determination accuracy of the malignant nodule;

the threshold determining unit is used for determining a high risk threshold and a low risk threshold according to the weight coefficient and the relative content of each VOC in the diagnosis marker set in each sample;

the model construction unit is used for taking the diagnosis marker set as a model risk assessment analysis factor and constructing a lung nodule malignant risk assessment model based on a weight coefficient, the high risk threshold and the low risk threshold of each VOC in the diagnosis marker set, which influence malignant nodule judgment accuracy.

Optionally, the weight coefficient calculating unit includes:

The linear regression prediction model construction unit is used for constructing a linear regression prediction model, and the expression function of the linear regression prediction model is as follows:

and the model parameter updating unit is used for iteratively updating each parameter of the parameter vector of the linear regression prediction model by using the log likelihood loss function as an optimization target and utilizing a gradient descent algorithm until a preset convergence condition is reached, and determining each parameter of the parameter vector of the linear regression prediction model as a weight coefficient of each VOC which affects the determination accuracy of the malignant nodule.

Optionally, the threshold determining unit includes:

A risk assessment score calculation unit, configured to multiply, for each sample, the weight coefficient of each VOC in the diagnostic marker set in the sample by the relative content of the VOC to obtain a risk assessment score of the VOC;

The disease risk value calculation unit is used for accumulating the risk assessment scores of all the VOCs in the diagnosis marker set in each sample to obtain a disease risk value of the sample;

an outlier determination unit for determining outliers in the disease risk values of the respective samples, the outliers comprising a malignant lung nodule lower bound outlier and a benign lung nodule upper bound outlier;

The high risk threshold determining unit is used for eliminating all lower limit outliers of malignant lung nodules from the diseased risk values of all samples, and detecting a threshold with highest sensitivity of the malignant lung nodules in the diseased risk values of the rest samples as a high risk threshold;

And the low risk threshold determining unit is used for eliminating all upper limit outliers of benign lung nodules from the diseased risk values of all samples, and detecting the threshold with highest specificity of the benign lung nodules from the diseased risk values of the rest samples as a low risk threshold.

Optionally, the outlier determining unit includes:

the data point set construction unit is used for constructing a data point set of the disease risk value of each sample;

the core point determining unit is used for calculating the number of data points in a preset neighborhood of the data points according to each data point in the data point set, and determining the data point as a core point if the number of the data points in the preset neighborhood of the data point is not smaller than a preset minimum point;

A boundary point determining unit, configured to determine, for each core point in the data point set, a data point of a non-core point in the preset neighborhood of the core point as a boundary point;

And the outlier determining unit is used for determining that the data points which are neither core points nor boundary points in the data point set are outliers, and determining the disease risk value corresponding to each outlier as an outlier.

Optionally, the apparatus further comprises:

The device comprises an expiration acquisition unit of an object to be detected, a detection unit and a detection unit, wherein the expiration acquisition unit is used for acquiring expiration of the object to be detected;

And the model output unit is used for inputting the expiration of the object to be tested to the lung nodule malignancy risk assessment model and outputting a risk assessment result of the object to be tested.

Optionally, the model output unit includes:

The model output first subunit is used for inputting the expired air of the object to be tested into the lung nodule malignant risk assessment model to obtain the content value of each risk assessment analysis VOC in the expired air of the object to be tested;

The model output second subunit is used for multiplying the content value of each risk assessment analysis VOC by the weight coefficient of the risk assessment analysis VOC, which influences the malignancy nodule judgment accuracy, through the lung nodule malignancy risk assessment model to obtain the risk assessment score of the risk assessment analysis VOC;

The model output third subunit is used for accumulating the risk assessment scores of all the risk assessment analysis VOCs through the lung nodule malignancy risk assessment model to obtain a lung nodule benign and malignancy comprehensive assessment score of the object to be tested;

And the model output fourth subunit is used for comparing the lung nodule benign and malignant comprehensive evaluation score with a high risk threshold and a low risk threshold through the lung nodule malignant risk evaluation model and outputting a risk evaluation result of the object to be tested.

A model building device for lung nodule malignancy risk assessment, comprising a memory and a processor;

The memory is used for storing programs;

The processor is configured to execute the program to implement the steps of the model building method for lung nodule malignancy risk assessment as described above.

A storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of a model building method for pulmonary nodule malignancy risk assessment as described above.

By means of the technical scheme, the method comprises the steps of obtaining breath samples of benign and malignant subjects of lung nodules, conducting primary screening to obtain a differential marker set, conducting secondary screening on benign and malignant markers of the lung nodules on the differential marker set through fusion of a random forest algorithm and recursive feature elimination and combination of a cross verification algorithm to obtain a diagnosis marker set, and constructing a lung nodule malignant risk assessment model through a VOC risk assessment method based on the diagnosis marker set and a plurality of samples to conduct lung nodule malignant risk assessment through the model. Therefore, the analysis of the benign and malignant risks of the lung nodules is carried out in a mode of collecting expiration, and two times of screening are carried out according to the correlation between the expiration metabolite and the benign and malignant risks of the lung nodules, so that the accuracy and reliability of the analysis result of the malignant risks of the lung nodules are ensured, and a doctor is effectively assisted to judge the benign and malignant risks of the lung nodules of the object to be analyzed.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the application. Also, like reference numerals are used to designate like parts throughout the figures. In the drawings:

FIG. 1 is a schematic flow chart of a model construction for achieving lung nodule malignancy risk assessment according to an embodiment of the present application;

FIG. 2 is a diagram of a spectrogram with corrected data according to an embodiment of the present application;

FIG. 3 is a graph of OPLS-DA distribution provided by an embodiment of the application;

FIG. 4 is a schematic diagram illustrating a high/low risk threshold setting according to an embodiment of the present application;

FIG. 5 is a ROC curve of a lung nodule malignancy risk assessment model provided by an embodiment of the present application;

FIG. 6 is a schematic diagram of a device for model construction for achieving lung nodule malignancy risk assessment according to an embodiment of the present application;

Fig. 7 is a schematic structural diagram of a device for model construction for achieving lung nodule malignancy risk assessment according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

The inventive solution may be implemented on the basis of a terminal with data processing capabilities, which may be a lung nodule benign and malignant diagnostic analysis system, which may be equipped with an exhalation acquisition device.

Next, as shown in connection with fig. 1, the model construction method of lung nodule malignancy risk assessment of the present application may include the steps of:

Step S110, obtaining exhalations of benign subjects of lung nodules and exhalations of malignant subjects of lung nodules as a plurality of samples.

Specifically, the lung nodule benign and malignant diagnosis analysis system can collect the expired air of the target object through the expired air collection device. The pulmonary nodule benign/malignant subject may be directed to exhale by the exhalation collection device such that the exhale is collected by the pulmonary nodule benign malignancy diagnostic analysis system.

Wherein the exhalation of each subject with benign/malignant lung nodules can be taken as a sample.

And step 120, performing primary screening on the benign and malignant lung nodule markers based on the exhales of the benign lung nodule subjects and the exhales of the malignant lung nodule subjects to obtain a differential marker set.

Specifically, the process of one screening may include:

And S1201, detecting the exhalations of the benign subjects of the lung nodules and the exhalations of the malignant subjects of the lung nodules by using a gas chromatograph and a mass spectrum to obtain a first exhalation compound spectrogram of the benign subjects of the lung nodules and a second exhalation compound spectrogram of the malignant subjects of the lung nodules.

Specifically, the exhaled breath of the benign subject of the lung nodule and the exhaled breath of the malignant subject of the lung nodule can be detected by using a gas chromatograph in combination with mass spectrometry to obtain a first compound spectrogram signal of the benign subject of the lung nodule and a second compound spectrogram signal of the malignant subject of the lung nodule, and then noise of the first compound spectrogram signal and the second compound spectrogram signal is removed to obtain a first compound spectrogram signal after noise removal and a second compound spectrogram signal after noise removal.

It will be appreciated that in order to reduce the effects of instrument detection fluctuations, environmental disturbances and human error on the spectrogram, and to improve the stability and effectiveness of the data analysis results, the spectrogram data may be pre-processed prior to the start of the data analysis, and the pre-processing may include noise removal and baseline correction.

Specifically, for each of the first compound spectrogram signal and the second compound spectrogram signal, if a target signal point exists in the compound spectrogram signal, replacing the signal value of the target signal point with the signal average value of two adjacent signal points of the target signal point to obtain the compound spectrogram signal after noise removal.

Wherein, the signal value of the target signal point is larger than the preset signal ratio of the signal value of any adjacent signal point of the target signal point.

It can be understood that Spike exists in the spectrogram signal of the compound, spike represents a prominent peak caused by noise, and a triangular peak is formed by three continuous points on the spectrogram signal, so that a threshold ratio can be set, as a preset signal ratio, when the signal value of a certain point is detected to be greater than the ratio of the signal values of two adjacent points, the point is considered to be sampling noise, and the Spike can be eliminated by using the average value of the signals of the two adjacent points to replace the abnormal point.

Further, baseline calibration is performed on the first compound spectrogram signal after noise removal and the second compound spectrogram signal after noise removal, so as to obtain the first compound spectrogram signal after baseline calibration and the second compound spectrogram signal after baseline calibration.

It will be appreciated that, under the influence of the instrument itself, the baseline values in the spectrogram data will typically deviate gradually from 0, so that the peak detection and quantification process is affected, and then the baseline drift may be fitted and calibrated with the curve obtained by the fitting.

Specifically, the baseline calibration process for compound spectrogram signals may include:

for each of the first and second noise-removed compound profile signals, selecting a signal at a non-peak position in the compound profile signal and constructing a baseline calibration curve for the compound profile signal based on the signal points at the non-peak positions using the formula:

Wherein, A baseline calibration curve for baseline values of compound spectrogram signals over time t,The polynomial highest order for the baseline calibration curve,For the n-th coefficient of the coefficient,The signal of the compound spectrogram can be fitted by a least square method.

And subtracting a baseline calibration curve of each compound spectrogram signal in the first compound spectrogram signal after noise removal and the second compound spectrogram signal after noise removal from the compound spectrogram signal to obtain a compound spectrogram signal after baseline calibration.

Still further, a first exhaled compound profile of the benign subject of the lung nodule is mapped based on the baseline calibrated first compound profile signal, and a second exhaled compound profile of the malignant subject of the lung nodule is mapped based on the baseline calibrated second compound profile signal.

Specifically, the corrected spectrogram data is drawn as shown in fig. 2, each color represents an expiratory chromatograph, the abscissa represents the retention time (in min/min), and the ordinate represents the chromatographic signal response value. The drawn spectrogram can conveniently check whether the data has problems or not, whether the data correction is finished according to expectations or not, and whether the requirements of subsequent data processing can be met or not.

S1202, carrying out normalization treatment, peak elimination treatment, peak filling treatment and Z-score treatment on the expiratory compound spectrogram aiming at each expiratory compound spectrogram in the first expiratory compound spectrogram and the second expiratory compound spectrogram to obtain a standardized expiratory compound spectrogram.

Specifically, in the process of normalizing the expiratory compound spectrogram and removing peaks, the process of normalizing the expiratory compound spectrogram may be:

and determining the compound existing in the sample and the relative content thereof by analyzing the characteristics of peak shape, peak height, peak area and the like in the expiratory compound spectrogram. Wherein, the peak processing can comprise identification, fitting and extraction.

And carrying out peak area normalization treatment on each VOC by taking the sample as a dimension, and converting the peak area into a relative content value, thereby obtaining a normalized expiratory compound spectrogram. The normalization formula is:

Wherein, The peak area of the i-th VOC may be represented,The ratio of the peak area of the ith VOC to the total peak area can be expressed,Is the number of all VOCs.

Further, the mass spectrum information comparison table can be used for searching the corresponding metabolite information with retention time, qualitative analysis of the VOC is completed, and the KEGG (Kyoto Encyclopedia of Genes and Genomes) database is used for inquiring the common database names, classification information, participation routes and other annotation information of the metabolites after qualitative analysis is completed.

The peak elimination process for the normalized expiratory compound spectrogram can be as follows:

removing compound peaks with compound peak areas which are missing in most samples from the expiratory compound spectrogram, and obtaining the expiratory compound spectrogram after the peak removing treatment.

Specifically, a preset threshold deletion ratio can be set, and compound peaks with compound peak areas larger than the preset threshold deletion ratio of the total sample are removed.

It will be appreciated that due to differences in subject and sampling environments, there is also a difference in the metabolite peaks detected for each sample, and that some compounds/metabolites may not be identified by the algorithm due to their too low concentration or severe interference from background noise, ultimately resulting in zero peak area. Such peaks have area values in only a small portion of the samples and most of the sample values are missing. When the value of this peak is missing in most samples, it may lead to errors or failures in the data analysis method, and therefore such peaks need to be rejected.

The process of performing peak filling treatment on the expiratory compound spectrogram after the peak elimination treatment may be:

filling the missing compound peaks of each sample of the expiratory compound spectrogram to obtain the expiratory compound spectrogram after the peak filling treatment.

Wherein the peak area value of the missing compound peak in at least one sample of each sample is 0, but the missing quantity is not more than the preset threshold missing proportion of the total sample.

It can be appreciated that although the peaks with more missing values are removed, the remaining peaks still have partial missing values, and the missing values need to be filled, otherwise, the accuracy of the subsequent analysis results is affected. Specifically, 1/2 of the minimum value other than 0 in the group can be used for filling the missing value, so that the difference between the groups is ensured.

The Z-score treatment of the peak-filled expiratory compound profile may be:

And performing Z-score processing on the data in the dimension of the peak, so that the data is mapped onto standard normal distribution, namely the average value is 0, and the standard deviation is 1, and the purpose is to make the weight of each peak consistent in the subsequent modeling process, and eliminate the deviation caused by the magnitude order.

Wherein, the Z-score formula is:

Wherein, The mean value of the data is represented,The standard deviation of the data is represented,Peak area of the i-th VOC after the peak filling process,The peak area of the VOC after Z-score in the characteristic dimension is shown.

S1203, calculating a false positive discovery rate FDR value and a variable projection importance index VIP value of each compound peak in the standardized expiratory compound spectrogram, determining compounds corresponding to compound peaks with FDR values smaller than a preset FDR threshold and VIP values larger than a preset VIP threshold in the standardized expiratory compound spectrogram as differential markers, and combining the differential markers to obtain a differential marker set.

Specifically, for each compound peak in the normalized exhaled compound spectrogram, the p value and VIP value of the compound peak are determined by comparing the peak of the compound peak in the normalized first exhaled compound spectrogram with the peak of the compound peak in the normalized second exhaled compound spectrogram, and the p value is converted to obtain the FDR value of the compound peak.

It will be appreciated that the VOC matrix data is continuous numerical and the data distribution is random. Univariate analysis was performed by using a mixture of independent T-tests (INDEPENDENT T-Test) and Rank Sum tests (Rank Sum Test), specifically if the data of the metabolites satisfy normal distribution, P-value (P-value) was calculated using independent T-Test, and conversely Rank Sum Test was used. Further, the P value was converted to FDR by using the BH (Benjamini-Hochberg) method, which was used to reduce false positive differential metabolites.

The FDR calculation mode of the BH method is represented by the following formula:

Wherein, In order to check the number of times,The p-value for the current test is ranked in all tests.

In calculating VIP values, multivariate analysis may be performed on the data using OPLS-DA (Orthogonal Projections to Latent Structures DISCRIMINANT ANALYSIS) and the two sets of distinctions may be visually presented using an OPLS-DA scatter plot, as shown in fig. 3. The abscissa t [1]P ] in FIG. 3 represents the predicted principal component score of the first principal component, demonstrating the inter-sample group differences; the ordinate t 1O represents the orthogonal principal component score, showing the differences in the sample set; each scattered point represents one sample, different colors of the scattered points represent different experimental groups, red is a malignant nodule sample, blue is a benign nodule sample, the farther the transverse distance between the samples is, the larger the difference between the groups is, and the closer the longitudinal distance is, the better the repeatability in the groups is; the red/blue elliptical shadows are 95% confidence elliptical intervals for different groupings, which can be understood as the distribution space of one sample grouping, and samples outside the elliptical shadows can be understood as 5% outliers.

A regression model is established by using OPLS, and the linear relation between X and Y is modeled, wherein X represents a VOC characteristic matrix, the rows of the matrix represent samples, the columns of the matrix represent VOC, and Y represents classification results. In the OPLS model, X and Y are projected into the principal component and residual space using feature extraction. The objective of OPLS is to extract the main correlation information from X and separate it from the uncorrelated information. The OPLS model is extended to classification problems using orthogonal decomposition, and the model is converted into a discriminant model by introducing class information.

Wherein, the matrix of X is decomposed into:

Wherein X is a prediction variable, T is a scoring matrix of X, The load matrix of X, and F is the residual matrix of X.

The matrix decomposition of Y is:

wherein Y is a prediction variable, T is a scoring matrix of X, The load matrix of Y, and G is the residual matrix of Y.

The orthogonalization process extracts the main relevant information:

Wherein, For the orthogonalization of X the score matrix,Is an orthogonalization score matrix of Y.

The discrimination model of OPLS-DA is:

Wherein, The orthogonalization score matrix is Y, X is an input variable matrix, B is a regression coefficient, the weight coefficient matrix of the prediction model is represented, and E is a residual matrix.

Further, differential metabolite screening was performed using FDR generated by univariate analysis and VIP values generated by multivariate analysis.

The calculation formula of the VIP value is as follows:

Wherein, VIP values representing the jth predicted variable,In order to determine the amount of the compound,The weight of the i-th compoundFrom the matrix of weight coefficients B,The coefficient representing the jth predicted variable in the ith compound.

Further, after calculating the FDR value and VIP of each compound peak, the determination of the differential markers may be performed by setting the FDR threshold and VIP threshold.

Specifically, the preset FDR threshold and the preset VIP threshold may be customized, for example, if the preset FDR threshold is 0.05 and the preset VIP threshold is 1, then all compounds with FDR values less than 0.05 and VIP values greater than 1 may be determined as differential markers.

Wherein, the various VOCs included in the differential marker set are shown in Table 1:

Table 1 various VOCs included in the differential marker set

As can be seen from table 1, based on the lung nodule malignant patient and benign nodule patient exhalation data, 54 potential lung nodule benign malignancy diagnostic markers were successfully obtained, and statistical analysis found that these VOCs had significant differences between lung malignant nodule patients and benign nodule patients (FDR < 0.05), which may be increased levels, decreased levels, or lack of VOCs compared to controls.

These VOCs can be largely divided into several classes, including alkanes, aromatics, organosulfides, ketones, aldehydes, and the like. Wherein the alkane is the highest in proportion and comprises hexane, 3-ethyl hexane, 4-methyl octane and the like. The aromatic hydrocarbon is composed of o-xylene, 3-ethyltoluene, propylbenzene, 1-methylnaphthalene, etc. In addition, some ketones such as 2-pentanone, 2-butanone, cyclohexanone, etc. Part of aldehydes such as hexanal, nonanal, heptanal, etc. And alcohols such as 1-propanol, isopropanol and organosulfur 1- (methylthio) propanes and other types of compounds such as 2, 5-dimethylfuran, acetonitrile and ethyl 4-ethoxybenzoate, and the like.

Wherein analysis shows that the concentration of a group of VOC molecules in malignant nodules of the lung is significantly higher than that of benign nodule groups of the lung, the VOC molecules comprise nonanal, hexanal, 2-pentanone, 1-propanol and the like, and the result shows that the concentration of the VOC molecules in human bodies is increased due to the change of certain metabolic pathways in the malignant nodule groups of the lung. In addition, another group of VOC molecules, including acetonitrile, isoprene, 2 butanone, etc., were found by analysis to be significantly lower in the lung malignant nodule population than in the lung benign nodule population, and this result indicated that some metabolic pathway changes in the lung malignant nodule population resulted in a decrease in the concentration of these VOC molecules in the human body.

Further data analysis showed a correlation between the concentration of these VOC molecules and the extent of development of pulmonary nodules. In particular, the concentration of hexane, hexanal and acetonitrile and 2 butanone are positively correlated with the size of the pulmonary nodules, while the concentration of acetonitrile and 2 butanone are negatively correlated with the size of the pulmonary nodules. The results of the study indicate that VOC molecules in exhalations can be used as biomarkers in the occurrence and progression of lung nodules to predict the occurrence and severity of disease.

And step 130, performing secondary screening on the benign and malignant lung nodule markers of the differential marker set by combining a random forest algorithm and recursive feature elimination with a cross-validation algorithm to obtain a diagnosis marker set.

Specifically, recursive feature elimination in combination with cross-validation (Recursive feature elimination with cross-validation, RFECV) is a greedy optimization algorithm aimed at finding feature subsets that perform best.

It can be appreciated that, since the differential marker set obtained by the primary screening contains more VOCs, in order to more accurately evaluate the risk of benign and malignant lung nodules, the secondary screening of the benign and malignant lung nodules can be further performed on the basis of the differential marker set, and the number of differential metabolites is reduced, so as to obtain the markers which are more strongly related to benign and malignant lung nodules.

And step 140, constructing a lung nodule malignancy risk assessment model by utilizing a VOC risk scoring method based on the diagnosis marker set and the plurality of samples.

According to the model construction method for lung nodule malignant risk assessment, through obtaining breath samples of benign and malignant subjects of a lung nodule, carrying out primary screening to obtain a differential marker set, further carrying out secondary screening on benign and malignant markers of the lung nodule on the differential marker set through combining a random forest algorithm and recursive feature elimination and a cross verification algorithm to obtain a diagnosis marker set, and therefore, on the basis of the diagnosis marker set and a plurality of samples, constructing a lung nodule malignant risk assessment model by utilizing a VOC risk scoring method to carry out lung nodule malignant risk assessment through the model. Therefore, the analysis of the benign and malignant risks of the lung nodules is carried out in a mode of collecting expiration, and two times of screening are carried out according to the correlation between the expiration metabolite and the benign and malignant risks of the lung nodules, so that the accuracy and reliability of the analysis result of the malignant risks of the lung nodules are ensured, and a doctor is effectively assisted to judge the benign and malignant risks of the lung nodules of the object to be analyzed.

In some embodiments of the present application, the process of performing the secondary screening of the benign and malignant lung nodule markers on the differential marker set to obtain the diagnostic marker set in the step S130 by combining the random forest algorithm and the recursive feature elimination with the cross-validation algorithm is described, where the process may include:

s1, constructing a feature set taking each VOC in the differential marker set as a feature.

S2, carrying out random forest modeling through the feature set to obtain an initialized random forest model, and calculating performance scores of the random forest model on a test set of the feature set.

Specifically, the performance score of the random forest model may be an Accuracy value, and the calculation mode may be the number of correctly classified samples divided by the total number of samples.

S3, calculating average reduction of the non-purity of each feature in the random forest model, and eliminating the feature with the minimum average reduction of the non-purity from the training set of the feature set so as to update the random forest model.

And each time the random forest model is updated, the test set and the training set of the feature set are alternated once.

It will be appreciated that to evaluate the performance of the model after feature removal RFECV would use cross-validation, in each round of which the dataset is equally proportioned into K shares, one of which is used as test data, the other K-1 data is used as training data, followed by successive rounds. RFECV perform feature elimination on the training set and evaluate model performance on the test set. This process is repeated, each time using a different training set and test set.

Specifically, the average reduced non-purity may be used as an importance score for each feature to measure the contribution of the feature to the random forest model performance, so that when updating, a least important feature may be selected for rejection according to the importance score of the feature.

Wherein the process of calculating the average reduced non-purity for each feature in the random forest model may include:

S4, calculating the performance score of the updated random forest model on the test set of the feature set, if the performance score is the same as the performance score of the random forest model before updating, determining the feature set corresponding to the updated random forest model as a diagnosis marker set, otherwise, returning to the step S3.

It will be appreciated that the performance score of the random forest model will rise (at least not fall) because each update is to cull the features that average reduce the uncertainty. If the performance score of the random forest model is not increased any more, the performance score is the same as the performance score before updating, so that the performance of the random forest model can be indicated to reach an optimal state, and the feature set corresponding to the updated random forest model can be determined to be a diagnosis marker set.

In addition, the condition for terminating the update may be the feature quantity.

Specifically, after calculating the performance scores of the updated random forest models on the test set of the feature sets, if the feature numbers in the feature sets corresponding to the updated random forest models reach the preset number, selecting the feature set corresponding to the random forest model with the highest performance score from the random forest models updated for multiple times, and taking the feature set as a diagnosis marker set.

According to the model construction method for lung nodule malignancy risk assessment, a random forest algorithm and a recursive feature elimination are combined, a cross verification algorithm is combined, secondary screening is conducted on lung nodule benign and malignancy markers on a difference marker set, the least important features are removed under the condition that each random forest model is updated, so that the remaining features are more critical, and the obtained VOC is more important for lung nodule malignancy risk assessment.

In some embodiments of the present application, the process of constructing the lung nodule malignancy risk assessment model using VOC risk scoring based on the diagnostic marker set and the plurality of samples in step S140 described above may include:

S1, counting the relative content of each VOC in the diagnosis marker set in each sample, and calculating a weight coefficient of each VOC affecting the accuracy of judging the malignant nodule.

Wherein the relative content of each VOC in the set of diagnostic markers in each sample may be a normalized relative content of VOC.

Specifically, logistic regression can be used to perform correlation analysis, and whether the normalized relative content of the VOC is an independent variable or not is a malignant nodule as a dependent variable, so as to obtain a weight coefficient of each VOC and the disease, wherein the weight coefficient represents the influence degree of the VOC on the accuracy of judging the malignant nodule.

The specific process of calculating the weight coefficient of each VOC that affects the accuracy of the malignant nodule determination may include:

s11, constructing a linear regression prediction model.

Wherein, the expression function of the linear regression prediction model is:

Wherein, In order to predict the probability value(s),For the parameter vector of the linear regression prediction model,Is an input feature vector. The input feature vector may consist of the relative content of each VOC in the set of diagnostic markers in a single sample.

Specifically, a hypothetical function is used to establish the relationship between the input features and the output tags. Assuming that the function takes the form of linear regression, the output of the linear regression is converted into a probability value by weighted summation of the input features and conversion by a logistic function.

S12, taking the log likelihood loss function as an optimization target, and iteratively updating each parameter of the parameter vector of the linear regression prediction model by using a gradient descent algorithm until a preset convergence condition is reached, and determining each parameter of the parameter vector of the linear regression prediction model as a weight coefficient of each VOC which affects the determination accuracy of the malignant nodule.

In particular, to evaluate the predictive effect of the model, logistic regression may use a log-likelihood loss function as an optimization objective. Wherein the log likelihood loss function is as follows:

Wherein, The loss function is represented by a function of the loss,Representing the number of samples to be taken,Representing the true label of the i-th sample,Representing the eigenvector of the i-th sample.

It will be appreciated that in order to calculate the weight coefficient for each feature, a gradient descent algorithm may be used for optimization. The gradient descent algorithm minimizes the loss function by iteratively updating the parameters. The parameter update formula of the gradient descent algorithm is as follows:

Wherein, The parameter of the j-th is indicated,The learning rate is indicated as being indicative of the learning rate,Representing the parameters of the loss function pairIs a partial derivative of (c).

Specifically, the preset convergence condition may be a maximum preset number of iterations, or a modulus (or norm) of the gradient less than a certain threshold (e.g., the magnitude of the gradient is less than 1 e-6).

S2, determining a high risk threshold and a low risk threshold according to the weight coefficient and the relative content of each VOC in the diagnosis marker set in each sample.

Specifically, the process of determining the high risk threshold and the low risk threshold according to the weight coefficient and the relative content of each VOC in the set of diagnostic markers in each sample may include:

S21, multiplying the weight coefficient of each VOC in the diagnosis marker set in each sample by the relative content of the VOC to obtain the risk assessment score of the VOC.

S22, for each sample, accumulating risk assessment scores of all VOCs in the diagnosis marker set in the sample to obtain a disease risk value of the sample.

Specifically, the risk of illness value of the sample is calculated as follows:

Wherein, As the value of the risk of developing a disease for the sample,As a function of the sigmoid,Is the relative content of the ith VOC of the sample,Is the weight corresponding to the i-th VOC, and n is the number of VOCs in the diagnostic marker collection.

S23, determining outliers in the disease risk values of the samples.

Wherein the outliers include a malignant lung nodule lower bound outlier and a benign lung nodule upper bound outlier.

Specifically, outliers may be determined in the disease risk values of the respective samples by a DBSCAN clustering algorithm, which may include:

s231, constructing a data point set of the disease risk values of each sample.

Specifically, the data point set may beAnd algorithm parameters may be determined: presetting a neighborhood radiusAnd presetting a minimum point number MinPts.

S232, calculating the number of data points in a preset neighborhood of the data points according to each data point in the data point set, and determining the data point as a core point if the number of the data points in the preset neighborhood of the data point is not smaller than the preset minimum point number.

S233, determining data points of non-core points in the preset neighborhood of the core point as boundary points for each core point in the data point set.

It will be appreciated that through the process of S232-S233, all points directly or indirectly connected to the core point form a cluster. Both the boundary point and the core point belong to the same cluster.

S234, determining data points which are neither core points nor boundary points in the data point set as outliers, and determining a disease risk value corresponding to each outlier as an outlier.

It will be appreciated that data points within the set of data points that cannot form clusters with the boundary points and the core points are neither core points nor boundary points, and are therefore outliers, so that the risk of illness value corresponding to an outlier can be determined as an outlier.

S24, eliminating the lower limit outlier of all malignant lung nodules from the diseased risk values of all samples, and detecting the threshold value with the highest sensitivity of the malignant lung nodules from the diseased risk values of the rest samples as a high risk threshold value.

S25, eliminating all upper limit outliers of benign lung nodules from the disease risk values of all samples, and detecting the threshold with highest specificity of benign lung nodules from the disease risk values of the rest samples as a low risk threshold.

S3, taking the diagnosis marker set as a model risk assessment analysis factor, and constructing a lung nodule malignant risk assessment model based on a weight coefficient, a high risk threshold and a low risk threshold of each VOC in the diagnosis marker set affecting malignant nodule judgment accuracy.

In particular, the lung nodule malignancy risk assessment model may divide the malignancy lung nodule risk level into three groups, a high risk threshold and a low risk threshold, a low risk group below the low risk threshold, a middle risk group between the high risk threshold and the low risk threshold, and a high risk group above the high risk threshold, respectively. The threshold setting schematic diagram is shown in fig. 4, and different colors represent risk score probability distributions of different grouping samples (blue is risk score probability distribution of a malignant nodule sample, and green is risk score probability distribution of a benign nodule sample); the abscissa represents the lung nodule malignancy risk score, the ordinate represents the relative frequency distribution density of the data points, and the area integral of the peak is the probability (total area is always 1); the two red dashed lines are risk score thresholds determined from this graph, with the left red line for distinguishing low risk/medium risk samples and the right red line for distinguishing medium risk/high risk samples.

According to the model construction method for lung nodule malignancy risk assessment, provided by the embodiment, two risk thresholds are determined by using a risk scoring method and aiming at a diagnosis marker set, so that the model construction method is applied to lung nodule malignancy risk stratification, and is suitable for different application scenes, for example, high sensitivity is required in a clinical cancer screening process to ensure that as many early cancer patients as possible are detected, so that the treatment success rate and the survival rate are improved. For diagnostic tests, high specificity can ensure that only individuals truly carrying the causative gene are diagnosed as positive, avoiding unnecessary anxiety and possible misdiagnosis.

In some embodiments of the present application, the diagnostic marker sets mentioned in the previous embodiments are further described. The resulting set of diagnostic markers specifically includes 13 VOCs for the exhalation dataset obtained for 218 malignant lung nodule patients and 242 benign lung nodule patients.

Specifically, the aromatic hydrocarbon compounds are 2, the alkane compounds are 2, the ketone compounds are 3, the aldehyde compounds are 2, the alkene compounds are 1 and the other compounds are 3.

Wherein the 2 aromatic compounds are selected from a set of aromatic candidate markers comprising o-xylene, 1-methylnaphthalene, 3-ethyltoluene, benzene, ethylbenzene, propylbenzene, trimethylbenzene, 1-methyl-3-propylbenzene and p-xylene;

The 2 alkane compounds are selected from a set of alkane candidate markers. The set of alkane candidate markers comprises hexane, cyclohexane, 2, 4-dimethylheptane, 4-methyl octane, n-dodecane, octane, methylcyclohexane, propylcyclohexane, 2-methyl heptane, propane, butane, 2-methylpentane, and pentane;

the 3 ketone compounds are selected from a set of candidate ketone markers. The set of ketone candidate markers comprises 2-pentanone, 2-butanone, acetone, 2, 3-hexanedione, and cyclohexanone;

The 2 aldehyde compounds are selected from a set of aldehyde candidate markers. The aldehyde candidate marker set comprises hexanal, nonanal, heptanal and octanal;

The 1 olefinic compound is selected from a set of olefinic candidate markers. The set of olefinic candidate markers includes isoprene, n-heptane, styrene, and 1-octene.

Further, training of the lung nodule malignancy risk assessment model is completed on the training set through a risk scoring algorithm, and the performance of the lung nodule malignancy risk assessment model is assessed through the testing set. The AUC of the final lung nodule malignancy risk assessment model reached 91%, sensitivity was 84%, specificity was 73%, ROC curve was as shown in fig. 5, and performance of the lung nodule malignancy risk assessment model was as shown in table 2:

TABLE 2 Performance parameters of lung nodule malignancy risk assessment model

Further, based on the risk assessment model, by setting different thresholds, the method is applied to benign and malignant risk stratification of lung nodules, and a threshold value 0.45 with high sensitivity can be obtained, wherein the threshold value is the threshold value with highest sensitivity after eliminating outliers of lower limit of malignant lung nodules, and the sensitivity is 97% under the threshold value, and the specificity is 62%. Furthermore, a highly conserved threshold of 0.78 was obtained, which was the highest specificity after elimination of the upper bound outlier of benign lung nodules, and at which the sensitivity was 57% and the specificity was 98%. Based on these two thresholds, the pulmonary nodule malignancy risk assessment model classifies the malignancy pulmonary nodule risk levels into three groups: less than 0.45 is a low risk group, 0.45-0.78 is a medium risk group, and more than 0.78 is a high risk group.

By using the lung nodule malignancy risk assessment model, resources can be allocated more reasonably, high risk nodules can be preferentially subjected to further examination and treatment, excessive diagnosis and treatment on low risk nodules are avoided, unnecessary invasive examination and treatment such as surgery or biopsy are reduced, and pain and medical cost of patients are reduced. It follows that the established lung nodule malignancy risk assessment has the advantage of non-invasive, rapid and reproducible compared to traditional clinical assessment methods, and can provide more accurate and personalized diagnostic results, contributing to early diagnosis and treatment of malignant nodules.

The device for constructing the model for realizing the lung nodule malignant risk assessment, which is described below, and the method for constructing the model for realizing the lung nodule malignant risk assessment, which are described below, can be referred to correspondingly.

Referring to fig. 6, fig. 6 is a schematic structural diagram of a device for implementing model construction for lung nodule malignancy risk assessment according to an embodiment of the present application.

As shown in fig. 6, the apparatus may include:

an expired air sample acquiring unit 11 for acquiring expired air of a benign subject of a lung nodule and expired air of a malignant subject of a lung nodule as a plurality of samples;

a differential marker screening unit 12 for performing primary screening of the benign and malignant lung nodule markers based on the exhales of the benign and malignant lung nodule subjects to obtain a differential marker set;

A diagnostic marker secondary screening unit 13, configured to perform secondary screening on benign and malignant lung nodule markers on the differential marker set by combining a random forest algorithm and recursive feature elimination with a cross validation algorithm, so as to obtain a diagnostic marker set;

A risk assessment model construction unit 14 for constructing a lung nodule malignancy risk assessment model using VOC risk scoring based on the set of diagnostic markers and the plurality of samples.

Optionally, the diagnostic marker secondary screening unit comprises:

Optionally, the apparatus further comprises:

Optionally, the average reduced non-purity calculating unit includes:

Optionally, the risk assessment model building unit includes:

Optionally, the weight coefficient calculating unit includes:

Optionally, the threshold determining unit includes:

Optionally, the outlier determining unit includes:

Optionally, the apparatus further comprises:

Optionally, the model output unit includes:

The device for model construction of lung nodule malignant risk assessment provided by the embodiment of the application can be applied to equipment for model construction of lung nodule malignant risk assessment, such as a terminal lung nodule benign and malignant diagnosis analysis system. Optionally, fig. 7 shows a hardware structural block diagram of a device for modeling a lung nodule malignancy risk assessment, and referring to fig. 7, the hardware structure of the device for modeling a lung nodule malignancy risk assessment may include: at least one processor 1, at least one communication interface 2, at least one memory 3 and at least one communication bus 4;

In the embodiment of the application, the number of the processor 1, the communication interface 2, the memory 3 and the communication bus 4 is at least one, and the processor 1, the communication interface 2 and the memory 3 complete the communication with each other through the communication bus 4;

The processor 1 may be a central processing unit CPU, or an Application-specific integrated Circuit ASIC (Application SPECIFIC INTEGRATED Circuit), or one or more integrated circuits configured to implement embodiments of the present invention, etc.;

The memory 3 may comprise a high-speed RAM memory, and may further comprise a non-volatile memory (non-volatile memory) or the like, such as at least one magnetic disk memory;

wherein the memory stores a program, the processor is operable to invoke the program stored in the memory, the program operable to:

Alternatively, the refinement function and the extension function of the program may be described with reference to the above.

The embodiment of the present application also provides a storage medium storing a program adapted to be executed by a processor, the program being configured to:

Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

In the present specification, each embodiment is described in a progressive manner, and each embodiment focuses on the difference from other embodiments, and may be combined according to needs, and the same similar parts may be referred to each other.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method of model construction for lung nodule malignancy risk assessment, comprising:

2. The method according to claim 1, wherein the performing secondary screening of the benign and malignant lung nodule markers on the differential marker set by combining a random forest algorithm and a recursive feature elimination with a cross-validation algorithm to obtain a diagnostic marker set comprises:

3. The method of claim 2, further comprising, after computing the performance score of the updated random forest model on the test set of feature sets:

4. The method of claim 2, wherein said calculating an average reduced non-purity for each feature in the random forest model comprises:

5. The method of claim 1, wherein constructing a lung nodule malignancy risk assessment model using VOC risk scoring based on the set of diagnostic markers and the plurality of samples comprises:

6. The method of claim 5, wherein calculating a weight coefficient for each VOC that affects malignancy nodule determination accuracy comprises:

7. The method of claim 5, wherein said determining a high risk threshold and a low risk threshold based on the weight coefficient and relative content of each VOC in the set of diagnostic markers in each sample comprises:

8. The method of claim 7, wherein determining outliers in the disease risk values for each sample comprises:

constructing a data point set of disease risk values of each sample;

9. The method according to any one of claims 1-8, wherein the diagnostic marker collection comprises 13 VOCs, in particular 2 aromatics, 2 alkanes, 3 ketones, 2 aldehydes, 1 olefins and 3 other classes of compounds, wherein,

10. The method of any one of claims 1-8, wherein the pulmonary nodule malignancy risk assessment model performs a process of assisting assessment comprising:

Obtaining the expiration of the object to be tested;

11. The method of claim 10, wherein inputting the exhalation of the subject to the lung nodule malignancy risk assessment model, outputting the risk assessment result of the subject, comprises:

12. A model construction apparatus for evaluating malignancy risk of lung nodule, comprising:

13. A model building device for lung nodule malignancy risk assessment, comprising a memory and a processor;

The memory is used for storing programs;

the processor for executing the program to implement the respective steps of the model construction method for pulmonary nodule malignancy risk assessment according to any one of claims 1-11.

14. A storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the model building method of pulmonary nodule malignancy risk assessment of any one of claims 1-11.