CN113974566B

CN113974566B - COPD acute exacerbation prediction method based on time window

Info

Publication number: CN113974566B
Application number: CN202111319613.8A
Authority: CN
Inventors: 王琨; 朱威; 李强; 陆银美; 侯应伟
Original assignee: Wuxi Qiyi Medical Technology Co ltd
Current assignee: Wuxi Qiyi Medical Technology Co ltd
Priority date: 2021-11-09
Filing date: 2021-11-09
Publication date: 2023-09-19
Anticipated expiration: 2041-11-09
Also published as: CN113974566A

Abstract

The invention discloses a time window-based method for predicting acute exacerbation of COPD, which comprises the following steps of S1, collecting lung indexes of a patient twice daily (in the morning and afternoon) by using a small lung instrument, an electronic stethoscope and other equipment; s2, predicting T+1, T+2 and T+3 days for a supported model, and keeping the usability of the model; s3, extracting more features according to the features, wherein the features can reflect the change condition of lung monitoring indexes of a patient; s4, taking one exacerbation (the first 7 days) as a positive sample; s5, performing significance test on the characteristics; s6, using the 235 significant features as a model to input parameters, predicting whether the patient is aggravated on the T+d day (d=1, 2, 3), and using lung monitoring data of a time window to predict whether the patient has acute exacerbation risk of COPD, so that the patient can monitor himself at home, and the method has important significance for home care of COPD patients.

Description

COPD acute exacerbation prediction method based on time window

Technical Field

The invention relates to the technical field of COPD acute exacerbation prediction, in particular to a time window-based COPD acute exacerbation prediction method.

Background

Chronic obstructive pulmonary disease (hereinafter referred to as "COPD") is a disease of chronic bronchitis, emphysema, which is a disease causing damage to alveolar structures, or a disease in which both occur and the airways from bronchi to alveoli are closed; symptoms of this disease include: long-term cough with sputum, respiratory distress due to reduced air flow rate caused by airway obstruction, and common respiratory tract infections (such as the common cold); this disease causes high mortality worldwide and increases rapidly due to smoking, air pollution, etc.; the etiology of COPD is an abnormal chronic inflammatory response of the lung to toxic molecules or gases and to various factors that are involved in COPD in complexity (such as smoking, urbanization, pollution, infectious respiratory disease, etc.).

Combinations of clinical parameters have been used to predict acute exacerbations of COPD in patients; however, these clinical parameters are not adequate for accurate prediction of individual cases; furthermore, while COPD patients may develop a likelihood of acute exacerbations after going to the hospital due to the factors described above, COPD patients cannot predict their own likelihood of acute exacerbations; thus, COPD patients may lead to poor results when going to the hospital after an acute exacerbation of COPD has occurred.

Although there are currently literature to predict COPD acute exacerbation events using statistical or machine learning means, the literature currently has the following drawbacks:

1. the existing research is mainly cross section data, and the time sequence data cannot be used for carrying out real-time early warning on the acute exacerbation event of the COPD of the patient;

2. the current research does not perform characteristic excavation of the system, and improves the prediction capability of the model;

3. current studies fail to make predictions of t+1, t+2, t+3, and existing models can only predict the risk probability of future exacerbations in patients.

Disclosure of Invention

The invention aims to provide a time window-based COPD acute exacerbation prediction method, which uses lung monitoring data of a time window to predict whether a patient has a COPD acute exacerbation risk or not by T+d days (d=1, 2, 3), and the patient can monitor himself at home, self-early warn, and the operation is simple, which has important significance for the home care of the COPD patient, so as to solve the problems in the background technology.

In order to achieve the above purpose, the present invention provides the following technical solutions:

a method for predicting acute exacerbation of COPD based on a time window, comprising the steps of:

s1, collecting lung indexes of a patient twice daily (morning and afternoon) by using a small lung instrument, an electronic stethoscope and other devices, such as FVC (fVC), FEV1 and the maximum energy value of lung vibration collected by the stethoscope, wherein the FVC adopts the instrument 'small lung instrument', and acquires forced vital capacity, namely the maximum air quantity which can be exhaled as soon as possible after the maximum inhalation is performed; the FEV1 adopts an instrument 'small lung instrument', and obtains the volume of the maximum exhalation after the maximum deep inhalation, wherein the volume of the gas exhaled by the maximum first second of exhalation; the PEF adopts an instrument 'small lung instrument', and obtains the instant flow rate when the expiratory flow is the fastest in the forced vital capacity measurement process;

s2, predicting T+1, T+2 and T+3 days for a supported model, and predicting by using patient lung monitoring indexes of a fixed time window (7 days) for maintaining the usability of the model, collecting 32 indexes of the patient in the morning and evening every day through electronic equipment, distinguishing the indexes of a five-day time window into date and whether the date is in the morning or not, wherein the characteristic quantity is 32 multiplied by 7 multiplied by 2=448;

s3, extracting more features according to the features, wherein the features can reflect the change condition of lung monitoring indexes of a patient; the data expansion includes: index sliding window statistics, such as 3 day mean/variance, 5 day mean/variance; a difference in the daytime index; 1744 extended feature numbers;

s4, taking one exacerbation (the first 7 days) as a positive sample; for negative samples, the prescribed time window cannot include 30 days before and after the period of the acute attack, so as to prevent the condition from affecting the monitoring index; the negative sample is generated by sampling all data which can be observed continuously for 7 days in the data;

s5, carrying out significance test on the features, and finding out whether 235 features aggravate on the T+d days (d=1, 2 and 3) and have significant correlation;

s6, using the 235 significant features as a model to input parameters, and predicting whether the T+d days (d=1, 2 and 3) are aggravated; the model adopts an integrated model based on a decision tree: xgboost, lightgbm and catheost, and model efficacy was evaluated using 5-fold cross-validation.

Preferably, the interpretation method of the XGBoost model comprises the following steps:

(1) Performing tree model element structure analysis on the XGBoost model to analyze the tree structure of each single tree;

(2) Inputting a test sample to the XGBoost model, and acquiring an effective leaf node corresponding to the test sample and an effective path of a tree of the effective leaf node according to a tree structure;

(3) And calculating a contribution value of the feature according to the effective path, and explaining the XGBoost model according to the obtained contribution value.

Preferably, the XGBoost uses a Boosting integration method, is largely used for data mining, and can process missing values and regularize features so as to realize the function of second-order acceleration optimization of the cost function.

Preferably, the LightGBM is a new gradient-lifted tree framework supporting GBDT, GBRT, GBM and MART algorithms, which is a complete solution for distributed training based on the DMTK framework.

Preferably, the Catboost algorithm includes: in the sensing period, the secondary user sends the energy value in the sensed channel to the fusion center as a characteristic energy vector, and the primary user intermittently sends information of occupying the frequency spectrum resource to the fusion center as a label, so that the construction of the training data set is completed. The model is trained in the fusion center using the Catboost algorithm.

Preferably, the Catboost algorithm is proposed by Yandex, which optimizes the processing of class features and computes leaf node values at the time of tree model selection, rather than data preprocessing, during the training phase, reducing overfitting.

Preferably, the predicted period duration takes eight days as a time window, the eight days are marked as (T-7, T-6, T-5, T-4, T-3, T-2, T-1, T), and for the positive sample, the T-th day is the acute exacerbation onset date; for negative examples, the prescribed time window cannot include 7 days before and after the period of the seizure.

Preferably, in order to achieve the effect of early warning, 3 groups of prediction tasks are set in advance in the prediction period:

(1) Task_1, adopting an observation value from T-5 days to T-1 day, and predicting whether the acute exacerbation is carried out on the T day;

(2) Task_2, adopting an observed value from T-6 days to T-2 days, and predicting whether the acute exacerbation is carried out on the T day;

(3) Task_3, using observations from day T-7 to day T-3, predicts whether or not day T is acutely aggravated.

Preferably, in order to reduce the number of features, a Kolmogorov-Smirnov test is performed on the features, the test can compare whether the two distributions are identical, and then the distribution of each feature on the positive sample and the distribution on the negative sample are tested, and the confidence coefficient is 0.05.

In summary, the beneficial effects of the invention are as follows due to the adoption of the technology:

1. the method can predict and early warn whether COPD acute exacerbation exists on the T+d days (d=1, 2, 3), and predict whether a patient has COPD acute exacerbation risk on the T+d days (d=1, 2, 3) by using lung monitoring data of a time window;

2. the invention uses the characteristic engineering, and the data construction method is as follows: the selection of positive and negative samples combines data sampling and medical knowledge, so that the model effect is remarkably improved;

3. the model has high practicability, the patient can monitor and self early warn at home, and the model is simple to operate, so that the model has important significance for the home care of COPD patients.

Drawings

FIG. 1 is a flow chart of the model construction of the present invention;

FIG. 2 is a ROC curve of five models based on LightGBM for task_1 setting of the present invention;

fig. 3 is a ROC curve of five models based on LightGBM under task_2 setting of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more clear, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments; all other embodiments, based on the embodiments of the invention, which a person of ordinary skill in the art would obtain without inventive faculty, are within the scope of the invention; thus, the following detailed description of the embodiments of the invention provided in the accompanying drawings is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention; all other embodiments, based on the embodiments of the invention, which a person of ordinary skill in the art would obtain without inventive faculty, are within the scope of the invention;

the invention provides a method for predicting acute exacerbation of COPD based on a time window, which is shown in figures 1-3 and comprises the following steps:

s1, collecting lung indexes of a patient twice daily (morning and afternoon) by using a small lung instrument, an electronic stethoscope and other devices, such as FVC (fVC), FEV1 and the maximum energy value of lung vibration collected by the stethoscope, wherein the FVC adopts the instrument 'small lung instrument', and acquires forced vital capacity, namely the maximum air quantity which can be exhaled as soon as possible after the maximum inhalation is performed; the FEV1 adopts an instrument 'small lung instrument', and obtains the volume of the maximum exhalation after the maximum deep inhalation, wherein the volume of the gas exhaled by the maximum first second of exhalation; the PEF adopts an instrument 'small lung instrument', and obtains the instant flow rate (lung index is shown in table 1) when the expiratory flow is the fastest in the forced vital capacity measurement process;

Specifically, the interpretation method of the XGBoost model comprises the following steps:

Specifically, the XGBoost utilizes a Boosting integration method, is largely used for data mining, and can process missing values and regularize features, thereby realizing the function of second-order acceleration optimization of the cost function.

Specifically, the LightGBM is a new gradient-lifted tree framework, supports GBDT, GBRT, GBM and MART algorithms, and is several times faster than the existing gradient-enhanced tree implementation due to its completely greedy tree growth method and histogram-based memory and computation optimization, and is a complete solution for distributed training based on the DMTK framework, which quickly becomes a common tool for data mining contestants after the emergence of the LightGBM.

Specifically, the Catboost algorithm includes: in the sensing period, the secondary user sends the energy value in the sensed channel to the fusion center as a characteristic energy vector, and the primary user intermittently sends information of occupying the frequency spectrum resource to the fusion center as a label, so that the construction of the training data set is completed. The model is trained in the fusion center using the Catboost algorithm.

Specifically, the Catboost algorithm is proposed by Yandex, optimizes the processing of class features, and calculates leaf node values at the time of tree model selection, during the training phase rather than the data preprocessing phase, reducing overfitting.

Specifically, the predicted period length takes eight days as a time window to intercept positive samples, the eight days are marked as (T-7, T-6, T-5, T-4, T-3, T-2, T-1, T), and for the positive samples, the T-th day is the starting date of the acute exacerbation; for negative examples, the prescribed time window cannot include 7 days before and after the period of the seizure.

Specifically, in order to achieve the effect of early warning in the prediction period, 3 groups of prediction tasks are set in advance:

Specifically, in order to reduce the number of features, a Kolmogorov-Smirnov test is performed on the features, the test can compare whether the two distributions are identical, and then the distribution of each feature on the positive sample and the distribution on the negative sample are tested, and the confidence coefficient is 0.05.

Table 1: observed value feature names and their interpretation;

table 2. Top fifty features that pass the significance test and P-value scoring;

using k-fold hierarchical cross-validation (k=5), the data was split into 5 folds, 8 at each time: 2 is divided into a training set and a testing set for training and testing the model. Verification indicates that the evaluation indexes are sensitivity, specificity and AUC, wherein the threshold is the minimum threshold that causes sensitivity to exceed 0.9, and the specificity is the specificity under the current threshold. The used model is catboost, xgboost and lightgbm, and other super parameters are obtained by performing super parameter search through cross verification; three tasks are set: task_1, task_2, task_3, under each Task, 5 models were set:

(1) M_all, training by adopting all the characteristics;

(2) M_sig, employing all features that pass the saliency test;

(3) M_sigste, using electronic stethoscope-related features that pass the significance test;

(4) M_siglsi, using small lung instrument features that pass the saliency test;

(5) M_sig50, adopting the first 50 features with the lowest p value passing the significance test under the task setting;

(6) M_sig25, employing the first 25 features that pass the saliency test;

(7) M_orig, training with all raw observations.

	Task_1	Task_2	Task_3
				M_all	0.8135	0.8135	0.8135
M_sig	0.9268	0.9045	0.8887
				M_sigSTE	0.9020	0.8845	0.8302
M_sigLSI	0.8279	0.7158	0.6617
				M_sig50	0.8826	0.8000	0.8631
M_sig25	0.8173	0.8075	0.8816
				M_orig	0.7361	0.7434	0.5782

Table 3. AUC mean score for cross validation;

the task_1 setting had 123 salient features, 31 small lung features passing the saliency test and 92 electronic stethoscopes.

The task_2 setting had 134 salient features, 33 of which passed the saliency test and 101 of which were electronic stethoscopes. The task_3 setting had 131 salient features, 28 small lung features passing the saliency test and 103 electronic stethoscopes.

Table 3 reports the AUC average score for cross-validation, where the model used was LightGBM. (1) Task_1 can get a higher score, which is consistent with visual understanding (one day after prediction is simpler than two or three days after prediction);

(2) Only features generated by a small lung instrument are obviously reduced in score, but only features generated by a stethoscope are still better in performance, so that the observation data of the electronic stethoscope has stronger judging and predicting effects;

(3) The adoption of the significance test to screen the features is obviously improved compared with the direct use of the original observed value or all the features;

(4) With features of front 50 or front 25 of significance, the model score will drop somewhat, indicating a reduced model fitting ability after the number of features is reduced. The ROC curves for five models based on LightGBM at task_1 setting as shown in fig. 1;

table 3 reports the AUC average score for cross-validation, where the model used was LightGBM. To verify performance under other models, we below give the effect under xgboost or catboost models:

	Task_1	Task_2	Task_3
				M_all	0.8772	0.8673	0.8142
M_sig	0.9181	0.8946	0.8233
				M_sigSTE	0.9036	0.8792	0.8110
M_sigLSI	0.8279	0.7610	0.7000
				M_sig50	0.8372	0.7831	0.8184
M_sig25	0.8177	0.8047	0.8203
				M_orig	0.7812	0.7881	0.6659

table 3-1. AUC mean score for cross-validation. The model used is xgboost.

/>

Table 3-2. AUC mean score for cross-validation. The model used is a catboost.

	Sensitivity to	Specificity (specificity)	Probability threshold
				Task_1M_sig50	0.9043	0.7345	0.0113
Task_2M_sig50	0.9043	0.7098	0.0091
				Task_3M_sig50	0.9043	0.6623	0.0042

Table 4. Sensitivity and specificity values of the optimal model M sig for each task setting.

In order to verify the effect of different decision tree models on the performance of our predicted task, the following table reports the model performance of the significant features under three task settings, we compared Xgboost, lightgbm with Catboost, the three most powerful decision tree model-based gradient lifting (gradient boosting) algorithms, and according to experimental results, lightgbm performs best under all three of our task settings.

	Lightgbm	Catboost	Xgboost
				Task_1M_sig	0.9268	0.8852	0.9181
Task_2M_sig	0.9045	0.8505	0.8946
				Task_3M_sig	0.8887	0.8722	0.8233

Table 5. Cross-validation average AUC for three classes of decision tree integration models, catboost, lightgbm, xgboost, based on the most characteristic combination, m_sig, per task setting.

Example 2

Five-fold cross-validation was performed on task1 using the Lightgbm model, with AUC on each fold expressed as follows:

five-fold cross-validation was performed on task2 using the Lightgbm model, with AUC on each fold expressed as follows:

five-fold cross-validation was performed on task3 using the Lightgbm model, with AUC on each fold expressed as follows:

example 3

Five-fold cross validation was performed on task1 using the Lightgbm model, with the following scores for average ACC, precision, recovery, f1, auc:

table 5. Average scores for various indicators cross-validated on task 1.

Five-fold cross validation was performed on task2 using the Lightgbm model, with the following scores for average ACC, precision, recovery, f1, auc:

	AUC	ACC	precision	recall	F1
						M_all	0.8135	0.8694	0.5415	0.7142	0.5475
M_sig	0.9045	0.8665	0.5365	0.9142	0.6441
						M_sigSTE	0.8845	0.9108	0.6758	0.5714	0.6112
M_sigLSI	0.7158	0.7509	0.2969	0.7428	0.4028
						M_sig50	0.8000	0.8807	0.4988	0.6243	0.5275
M_sig25	0.8075	0.8950	0.4902	0.6571	0.5322
						M_orig	0.7434	0.8423	0.7107	0.6	0.5572

table 6. Average scores for various indicators cross-validated on task 2.

Five-fold cross validation was performed on task3 using the Lightgbm model, with the following scores for average ACC, precision, recovery, f1, auc:

table 7. Average scores for various indicators cross-validated on task 3.

The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art, who is within the scope of the present invention, should make equivalent substitutions or modifications according to the technical scheme of the present invention and the inventive concept thereof, and should be covered by the scope of the present invention.

It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions; moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

Claims

1. A method for predicting acute exacerbation of COPD based on a time window, the method comprising the steps of:

s1, collecting lung indexes of a patient twice daily by using a small lung instrument, wherein the lung indexes comprise lung vibration energy maximum values collected by FVC, FEV1, PEF and a stethoscope in the morning and afternoon, wherein the FVC adopts the instrument 'small lung instrument' to obtain forced vital capacity, namely the maximum air volume which can be exhaled as soon as possible after the maximum inhalation is performed; the FEV1 adopts an instrument 'small lung instrument', and obtains the volume of the maximum exhalation after the maximum deep inhalation, wherein the volume of the gas exhaled by the maximum first second of exhalation; the PEF adopts an instrument 'small lung instrument', and obtains the instant flow rate when the expiratory flow is the fastest in the forced vital capacity measurement process;

s2, predicting T+1, T+2 and T+3 days for a supported model, predicting lung monitoring indexes of a patient by using a fixed time 7-day window, collecting 32 indexes of the patient in the morning and evening every day through electronic equipment, distinguishing the indexes of the 7-day time window into date and whether the date is in the morning or not, and obtaining the characteristic quantity of 32 multiplied by 7 multiplied by 2=448 in order to keep the usability of the model;

s3, extracting more features according to the features, wherein the features can reflect the change condition of lung monitoring indexes of a patient; the data expansion includes: index sliding window statistics; a difference in the daytime index;

s4, taking the first exacerbation and the first 7 days as a positive sample; for negative samples, the prescribed time window cannot include 30 days before and after the period of the acute attack, so as to prevent the condition from affecting the monitoring index; the negative sample is generated by sampling all data which can be observed continuously for 7 days in the data;

s5, carrying out significance test on the features, and finding out whether 235 features have significant correlation on the aggravation of the T+d days, wherein d=1, 2 and 3;

s6, using the 235 significant features as a model to input parameters, and predicting whether the T+d day is aggravated; the model adopts an integrated model based on a decision tree: XGBoost, lightGBM and Catboost, and evaluate model effects using 5-fold cross-validation;

the prediction period length takes eight days as a time window to intercept positive samples, and the eight days are marked as T-7, T-6, T-5, T-4, T-3, T-2, T-1 and T, and for the positive samples, the T-th day is the starting date of the acute exacerbation; for negative samples, the prescribed time window cannot include 7 days before and after the period of the seizure;

in order to achieve the effect of early warning, 3 groups of prediction tasks are set in advance in the prediction period:

(3) Task_3, using the observed value from T-7 days to T-3 days, predicts whether the acute exacerbation is carried out on the T day;

to reduce the number of features, a Kolmogorov-Smirnov test is performed on the features, which compares whether the two distributions are identical, and then tests the distribution of each feature on the positive sample and the distribution on the negative sample, taking a confidence of 0.05.

2. A method for predicting acute exacerbations of COPD based on a time window according to claim 1, wherein: the interpretation method of the XGBoost model comprises the following steps:

3. A method for predicting acute exacerbations of COPD based on a time window according to claim 1, wherein: the XGBoost is a Boosting integration method, is largely used for data mining, and can process missing values and regularize features, so that the function of second-order acceleration optimization of the cost function is realized.

4. A method for predicting acute exacerbations of COPD based on a time window according to claim 1, wherein: the LightGBM is a new gradient-lifted tree framework, supporting GBDT, GBRT, GBM and MART algorithms, which is a complete solution for distributed training based on the DMTK framework.

5. A method for predicting acute exacerbations of COPD based on a time window according to claim 1, wherein: the Catboost algorithm includes: in the sensing period, the secondary user sends the energy value in the sensed channel to the fusion center as a characteristic energy vector, and the primary user intermittently sends information of occupying the frequency spectrum resource or not to the fusion center as a label, so that the construction of a training data set is completed, and a model is trained by a Catboost algorithm in the fusion center.