CN113974566A

CN113974566A - COPD acute exacerbation prediction method based on time window

Info

Publication number: CN113974566A
Application number: CN202111319613.8A
Authority: CN
Inventors: 王琨; 朱威; 李强; 陆银美; 侯应伟
Original assignee: Wuxi Qiyi Medical Technology Co ltd
Current assignee: Wuxi Qiyi Medical Technology Co ltd
Priority date: 2021-11-09
Filing date: 2021-11-09
Publication date: 2022-01-28
Anticipated expiration: 2041-11-09
Also published as: CN113974566B

Abstract

The invention discloses a COPD acute exacerbation prediction method based on a time window, S1, collecting pulmonary indexes of a patient twice a day (morning and afternoon) by utilizing devices such as a small lung instrument, an electronic stethoscope and the like; s2, predicting T +1, T +2 and T +3 days for the supported model, and keeping the model easy to use; s3, extracting more features according to the features, wherein the features can reflect the change condition of the lung monitoring index of the patient; s4, taking one exacerbation (and the previous 7 days) as a positive sample; s5, carrying out significance test on the characteristics; and S6, inputting parameters of the model by using the 235 significant characteristics, predicting whether the T + d days (d is 1, 2 and 3) are aggravated, predicting whether the patient has the risk of acute exacerbation of the COPD by using the lung monitoring data of a time window, and enabling the patient to monitor the patient at home to have significance for home care of the COPD patient.

Description

COPD acute exacerbation prediction method based on time window

Technical Field

The invention relates to the technical field of COPD acute exacerbation prediction, in particular to a COPD acute exacerbation prediction method based on a time window.

Background

Chronic obstructive pulmonary disease (hereinafter referred to as "COPD") is a disease of chronic bronchitis, emphysema, a disease causing damage to alveolar structures, or a mixture of both and closure of airways from bronchi to alveoli; symptoms of this disease include: long-term cough with sputum, dyspnea due to a drop in air flow rate caused by airway obstruction, and common respiratory infections (such as the common cold); this disease causes high mortality worldwide, and is rapidly increasing due to smoking, air pollution, and the like; the etiology of COPD is an abnormal chronic inflammatory response of the lungs to toxic molecules or gases, as well as various factors that are complexly involved in COPD, such as smoking, urbanization, pollution, infectious respiratory disease, and the like.

Combinations of clinical parameters have been used to predict acute exacerbations of COPD in patients; however, these clinical parameters are not accurate enough for individual case predictions; furthermore, although COPD patients may develop a possibility of acute exacerbation after going to hospital due to the above-mentioned factors, COPD patients cannot predict the possibility of their own acute exacerbation; thus, the visit to the hospital after an acute exacerbation of COPD may lead to adverse outcomes for COPD patients.

Although literature is available for predicting COPD acute exacerbation events using statistical or machine learning means, the current literature in the field has the following drawbacks:

1. the existing studies are mainly cross-sectional data, and fail to use time-series data to perform real-time early warning on COPD acute exacerbation events of patients;

2. the current research does not carry out systematic feature mining, and the prediction capability of the model is improved;

3. the current research can not predict T +1, T +2 and T +3, and the existing model can only predict the risk probability of future exacerbation of the patient.

Disclosure of Invention

The invention aims to provide a COPD acute exacerbation prediction method based on a time window, which uses lung monitoring data of the time window to predict whether a patient has the risk of COPD acute exacerbation for T + d days (d is 1, 2, 3), so that the patient can monitor and warn himself at home, the operation is simple, and the method has significance for home care of the COPD patient and solves the problems in the background technology.

In order to achieve the purpose, the invention provides the following technical scheme:

a COPD acute exacerbation prediction method based on a time window comprises the following steps:

s1, collecting lung indexes of a patient twice a day (morning and afternoon) by using devices such as a small lung instrument, an electronic stethoscope and the like, such as FVC (pneumotach) and FEV (FEV 1), and the maximum value of the energy of lung vibration collected by the stethoscope, wherein the FVC adopts an instrument 'small lung instrument', and obtains forced vital capacity, namely the maximum air volume capable of being exhaled as soon as possible after the maximum inhalation is tried; the FEV1 adopts an instrument, namely a small lung instrument, to obtain the volume of the air volume which is maximally breathed after the maximum deep inspiration and is maximally breathed out for the first second; the PEF adopts an instrument, namely a small lung instrument, and obtains the instantaneous flow rate when the expiratory flow is fastest in the forced vital capacity measuring process;

s2, in order to support the model, it can predict T +1, T +2, T +3 days, and in order to keep the model easy to use, it uses the fixed time window (7 days) of the patient lung monitoring index to predict, collects the patient' S32 indexes every morning and evening through the electronic device, and distinguishes the indexes of five-day time window into date and whether morning, then the number of features is 32 × 7 × 2-448;

s3, extracting more features according to the features, wherein the features can reflect the change condition of the lung monitoring index of the patient; the data expansion comprises the following steps: index sliding window statistics such as 3 day mean/variance, 5 day mean/variance; the difference of the alternate-day indexes; 1744 expanded feature numbers;

s4, taking one exacerbation (and the previous 7 days) as a positive sample; for negative samples, the specified time window cannot include 30 days before and after the acute attack period, so as to prevent the disease condition from influencing the monitoring index; the negative sample is generated by sampling all data which can be observed continuously for 7 days in the data;

s5, performing significance test on the features, and finding out whether 235 features have significant correlation to the increase of the T + d day (d is 1, 2, 3);

s6, using the 235 significant features as model input parameters, predicting whether the day T + d (d is 1, 2, 3) is heavy; the model adopts an integrated model based on a decision tree: xgboost, lightgbm, and catboost, and 5-fold cross validation was used to evaluate the model effect.

Preferably, the XGBoost model interpretation method includes the following steps:

(1) analyzing the tree model element structure of the XGboost model to analyze the tree structure of each single tree;

(2) inputting a test sample into the XGboost model, and acquiring an effective leaf node corresponding to the test sample and an effective path of a tree of the effective leaf node according to a tree structure;

(3) and calculating a contribution value of the feature according to the effective path, and explaining the XGboost model according to the acquired contribution value.

Preferably, the XGBoost utilizes a Boosting integration method, a large amount of XGBoost is used for data mining, and the XGBoost can process missing values and regularize features, so that a second-order accelerated optimization function of a cost function is realized.

Preferably, the LightGBM is a new gradient spanning tree framework, supporting algorithms of GBDT, GBRT, GBM and MART, and is a complete solution for distributed training based on DMTK framework.

Preferably, the Catboost algorithm includes: in a sensing period, the secondary user sends the sensed energy value in the channel to the fusion center as a characteristic energy vector, and the primary user sends information of whether the spectrum resources are occupied or not to the fusion center as a label discontinuously, so that the construction of a training data set is completed. The model was trained with the Catboost algorithm in the fusion center.

Preferably, the Catboost algorithm is proposed by Yandex, optimizes the processing of the class characteristics, and calculates the leaf node values when selecting the tree model in the training stage rather than the data preprocessing stage, so as to reduce overfitting.

Preferably, the prediction period is longer than eight days, namely, the positive sample is intercepted in a time window of eight days (T-7, T-6, T-5, T-4, T-3, T-2, T-1 and T), and for the positive sample, the Tth day is the acute exacerbation starting date; for negative samples, the specified time window cannot include the 7 days before and after the acute episode.

Preferably, in order to achieve the effect of early warning, the prediction period is set with 3 groups of prediction tasks in advance:

(1) task _1, adopting observed values from T-5 days to T-1 days to predict whether acute exacerbation occurs on the T day;

(2) task _2, adopting observed values from T-6 days to T-2 days to predict whether acute exacerbation occurs on the T day;

(3) task _3, using observations from day T-7 to day T-3, predicts whether acute exacerbation occurred on day T.

Preferably, to reduce the number of features, a Kolmogorov-Smirnov test is performed on the features, which can compare whether the two distributions are the same, and then the distribution of each feature on the positive sample and the distribution on the negative sample are tested with a confidence level of 0.05.

In summary, due to the adoption of the technology, the invention has the beneficial effects that:

1. the invention can predict and early warn whether the COPd acute exacerbation exists on the T + d days (d is 1, 2, 3), and predict whether the patient has the COPD acute exacerbation risk or not by using the lung monitoring data of a time window;

2. compared with the method only using the original detection index value, the method for constructing the data comprises the following steps of: the positive and negative samples are selected, data sampling and medical knowledge are combined, and the model effect is remarkably improved;

3. the model of the invention has high practicability, the patient can monitor and early warn at home, and the operation is simple, which is of great significance for the home care of the COPD patient.

Drawings

FIG. 1 is a flow chart of model construction according to the present invention;

FIG. 2 is a ROC curve of five LightGBM-based models under the Task _1 setting of the present invention;

FIG. 3 is ROC curves of five LightGBM-based models under Task _2 setting of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all embodiments of the present invention; all other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present invention without creative efforts belong to the protection scope of the present invention; thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention; all other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present invention without creative efforts belong to the protection scope of the present invention;

the invention provides a COPD acute exacerbation prediction method based on a time window as shown in figures 1-3, which comprises the following steps:

s1, collecting lung indexes of a patient twice a day (morning and afternoon) by using devices such as a small lung instrument, an electronic stethoscope and the like, such as FVC (pneumotach) and FEV (FEV 1), and the maximum value of the energy of lung vibration collected by the stethoscope, wherein the FVC adopts an instrument 'small lung instrument', and obtains forced vital capacity, namely the maximum air volume capable of being exhaled as soon as possible after the maximum inhalation is tried; the FEV1 adopts an instrument, namely a small lung instrument, to obtain the volume of the air volume which is maximally breathed after the maximum deep inspiration and is maximally breathed out for the first second; the PEF adopts an instrument, namely a small lung instrument, and obtains the instantaneous flow rate when the expiratory flow is fastest in the forced vital capacity measuring process (the lung index is shown in a table 1);

Specifically, the XGboost model interpretation method comprises the following steps:

Specifically, the XGBoost utilizes a Boosting integration method, is largely used for data mining, and can process missing values and regularize features, so that a second-order accelerated optimization function of a cost function is realized.

Specifically, the LightGBM is a new gradient-boosted tree framework, supports algorithms of GBDT, GBRT, GBM and MART, is several times faster than the existing gradient-enhanced tree due to a completely greedy tree growth method and histogram-based memory and computational optimization, is a complete solution for distributed training based on the DMTK framework, and quickly becomes a common tool for data mining contestants after the occurrence of the LightGBM.

Specifically, the Catboost algorithm includes: in a sensing period, the secondary user sends the sensed energy value in the channel to the fusion center as a characteristic energy vector, and the primary user sends information of whether the spectrum resources are occupied or not to the fusion center as a label discontinuously, so that the construction of a training data set is completed. The model was trained with the Catboost algorithm in the fusion center.

Specifically, the Catboost algorithm is proposed by Yandex, optimizes the processing of the class characteristics, and calculates the leaf node values when selecting the tree model in the training stage instead of the data preprocessing stage, thereby reducing overfitting.

Specifically, the prediction period duration takes eight days as a time window to intercept a positive sample, the eight days are marked as (T-7, T-6, T-5, T-4, T-3, T-2, T-1, T), and for the positive sample, the Tth day is the acute exacerbation starting date; for negative samples, the specified time window cannot include the 7 days before and after the acute episode.

Specifically, in order to achieve the early warning effect in the prediction period, 3 groups of prediction tasks are set in advance:

Specifically, in order to reduce the number of features, a Kolmogorov-Smirnov test is performed on the features, the test can compare whether the two distributions are the same, and then the distribution of each feature on a positive sample and the distribution on a negative sample are tested, and the confidence coefficient is taken to be 0.05.

Table 1: observation value feature names and their interpretation;

table 2. score the first fifty features and their P-values by significance test;

using a k-fold hierarchical cross validation (k ═ 5), data were divided into 5 folds, each time at 8: and 2, training and testing the model by dividing the model into a training set and a testing set. The validation indicates evaluation indices of sensitivity, specificity and AUC, where the threshold is the minimum threshold that allows sensitivity to exceed 0.9 and the specificity is the specificity at the current threshold. The used models are catboost, xgboost and lightgbm, and other hyper-parameters are obtained by performing hyper-parameter search through cross validation; three tasks are set: task _1, Task _2, and Task _3, under each Task, 5 models are set:

(1) m _ all, training by adopting all the characteristics;

(2) m _ sig, using all features that pass significance tests;

(3) m _ signature, which adopts the relevant characteristics of the electronic stethoscope passing the significance test;

(4) m _ sigLSI, using characteristics of the small lung apparatus passing significance test;

(5) m _ sig50, the first 50 features with the lowest p-value passing significance test under the task setting are adopted;

(6) m _ sig25, using top 25 features that pass the significance test;

(7) m _ orig, training by adopting all original observation indexes.

	Task_1	Task_2	Task_3
				M_all	0.8135	0.8135	0.8135
M_sig	0.9268	0.9045	0.8887
				M_sigSTE	0.9020	0.8845	0.8302
M_sigLSI	0.8279	0.7158	0.6617
				M_sig50	0.8826	0.8000	0.8631
M_sig25	0.8173	0.8075	0.8816
				M_orig	0.7361	0.7434	0.5782

Table 3. AUC mean score of cross validation;

under the Task _1 setting, the number of the salient features is 123, wherein the number of the small lung instrument features passing the significance test is 31, and the number of the electronic stethoscopes is 92.

Under the Task _2 setting, the significant features are 134, wherein 33 small lung instruments pass the significance test, and 101 electronic stethoscopes pass the significance test. Under the Task _3 setting, the number of the significant features is 131, wherein 28 of the small lung instrument features pass the significance test, and 103 of the electronic stethoscopes pass the significance test.

Table 3 reports the AUC mean score of the cross validation, where the model used is LightGBM. (1) Task _1 can get a higher score, consistent with intuitive understanding (predicting the next day is simpler than predicting the next two or three days);

(2) the score is obviously reduced only by using the characteristics generated by the small lung instrument, and the characteristics generated by using the stethoscope still have better performance, which indicates that the observed data of the electronic stethoscope has stronger discrimination and prediction functions;

(3) the method adopts the significance test to screen the features, and is obviously improved compared with the method of directly using the original observed value or all the features;

(4) with the top 50 features of significance or the top 25 features, the model score will decrease somewhat, indicating that the model fitting ability decreases after the number of features is reduced. ROC curves for five models based on LightGBM under Task _1 setting as shown in fig. 1;

table 3 reports the AUC mean score of the cross validation, where the model used is LightGBM. To verify the performance under other models, we give the following effect under the xgboost or catboost model:

	Task_1	Task_2	Task_3
				M_all	0.8772	0.8673	0.8142
M_sig	0.9181	0.8946	0.8233
				M_sigSTE	0.9036	0.8792	0.8110
M_sigLSI	0.8279	0.7610	0.7000
				M_sig50	0.8372	0.7831	0.8184
M_sig25	0.8177	0.8047	0.8203
				M_orig	0.7812	0.7881	0.6659

table 3-1. AUC mean score of cross validation. The model used is xgboost.

Table 3-2. AUC mean score of cross validation. The model used was catboost.

	Sensitivity of the composition	Specificity of	Probability threshold
				Task_1M_sig50	0.9043	0.7345	0.0113
Task_2M_sig50	0.9043	0.7098	0.0091
				Task_3M_sig50	0.9043	0.6623	0.0042

And 4, setting the sensitivity and specificity values of the optimal model M _ sig under each task.

In order to verify the influence of different decision tree models on the performance of the predicted tasks, the following table reports model performance with significant features under three task settings, and we compare Xgboost, Lightgbm and Catboost, and the three strongest gradient boosting (gradient boosting) algorithms based on the decision tree models perform the best under the three task settings according to experimental results.

	Lightgbm	Catboost	Xgboost
				Task_1M_sig	0.9268	0.8852	0.9181
Task_2M_sig	0.9045	0.8505	0.8946
				Task_3M_sig	0.8887	0.8722	0.8233

And 5, cross validation average AUC of three types of decision tree integration models Catboost, Lightgbm and Xgboost based on the most characteristic combination M _ sig under each task setting.

Example 2

Five-fold cross validation was performed on task1 using the Lightgbm model, with the AUC per fold appearing as follows:

five-fold cross validation was performed on task2 using the Lightgbm model, with the AUC per fold appearing as follows:

five-fold cross validation was performed on task3 using the Lightgbm model, with the AUC per fold appearing as follows:

example 3

Five-fold cross validation was performed on task1 using the Lightgbm model, with average ACC, precision, recall, f1, auc scores as follows:

table 5 mean scores for various indices cross-validated on task 1.

Five-fold cross validation was performed on task2 using the Lightgbm model, with average ACC, precision, recall, f1, auc scores as follows:

	AUC	ACC	precision	recall	F1
						M_all	0.8135	0.8694	0.5415	0.7142	0.5475
M_sig	0.9045	0.8665	0.5365	0.9142	0.6441
						M_sigSTE	0.8845	0.9108	0.6758	0.5714	0.6112
M_sigLSI	0.7158	0.7509	0.2969	0.7428	0.4028
						M_sig50	0.8000	0.8807	0.4988	0.6243	0.5275
M_sig25	0.8075	0.8950	0.4902	0.6571	0.5322
						M_orig	0.7434	0.8423	0.7107	0.6	0.5572

table 6 average scores for various indices cross-validated at Task 2.

Five-fold cross validation was performed on task3 using the Lightgbm model, with average ACC, precision, recall, f1, auc scores as follows:

table 7 average scores for various indices cross-validated at Task 3.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art should be considered to be within the technical scope of the present invention, and the technical solutions and the inventive concepts thereof according to the present invention should be equivalent or changed within the scope of the present invention.

It is to be noted that, in this document, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions; also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

Claims

1. A COPD acute exacerbation prediction method based on a time window is characterized by comprising the following steps:

s3, extracting more features according to the features, wherein the features can reflect the change condition of the lung monitoring index of the patient; the data expansion comprises the following steps: an index sliding window statistic; the difference of the alternate-day indexes;

2. The method of claim 1, wherein the method comprises: the XGboost model interpretation method comprises the following steps:

3. The method of claim 1, wherein the method comprises: the XGboost utilizes a Boosting integration method, is largely used for data mining, and can process missing values and regularize features, so that the function of second-order accelerated optimization of a cost function is realized.

4. The method of claim 1, wherein the method comprises: the LightGBM is a new gradient lifting tree framework, supports the algorithm of GBDT, GBRT, GBM and MART, and is a complete solution of distributed training based on the DMTK framework.

5. The method of claim 1, wherein the method comprises: the Catboost algorithm includes: in a sensing period, the secondary user sends the sensed energy value in the channel to the fusion center as a characteristic energy vector, and the primary user sends information of whether the spectrum resources are occupied or not to the fusion center as a label discontinuously, so that the construction of a training data set is completed. The model was trained with the Catboost algorithm in the fusion center.

6. The method of claim 1, wherein the method comprises: the Catboost algorithm is proposed by Yandex, optimizes the processing of the class characteristics, processes in a training stage instead of a data preprocessing stage, and calculates leaf node values when a tree model is selected to reduce overfitting.

7. The method of claim 1, wherein the method comprises: the prediction period is long, eight days are taken as a time window to intercept positive samples, the eight days are marked as (T-7, T-6, T-5, T-4, T-3, T-2, T-1, T), and for the positive samples, the Tth day is the acute exacerbation starting date; for negative samples, the specified time window cannot include the 7 days before and after the acute episode.

8. The method of claim 7, wherein the method comprises: in order to achieve the effect of early warning in the prediction period, 3 groups of prediction tasks are set in advance:

9. The method of claim 1, wherein the method comprises: in order to reduce the number of features, a Kolmogorov-Smirnov test is performed on the features, the test can be used for comparing whether the two distributions are the same, and then the distribution of each feature on a positive sample and the distribution on a negative sample are tested, and the confidence coefficient is 0.05.