WO2019153595A1

WO2019153595A1 - Method for predicting risk of chronic obstructive pulmonary disease, server, and computer readable storage medium

Info

Publication number: WO2019153595A1
Application number: PCT/CN2018/089343
Authority: WO
Inventors: 阮晓雯; 徐亮; 肖京
Original assignee: 平安科技（深圳）有限公司
Priority date: 2018-02-07
Filing date: 2018-05-31
Publication date: 2019-08-15
Also published as: CN108257675A

Abstract

A method for predicting risk of chronic obstructive pulmonary disease, a server (2), and a computer readable storage medium. The method comprises: configuring a range of user information required to be obtained (S400, S500); obtaining relevant sample data according to the range of user information (S402, S502); establishing a plurality of models according to the sample data, performing training and testing, and then filtering to obtain an optimal model combination (S404); and establishing a combined classifier model according to the optimal model combination (S406, S508). The method for predicting risk of chronic obstructive pulmonary disease, the server (2), and the computer readable storage medium are capable of predicting the risk of an individual having chronic obstructive pulmonary disease within the coming year.

Description

Method for predicting risk of chronic obstructive pulmonary disease, server and computer readable storage medium

Priority claim

This application claims priority to Chinese Patent Application No. 201810125017.8, entitled "Slow-resistance lung risk risk prediction method, server and computer-readable storage medium" on February 7, 2018, the content of which is the priority of the Chinese patent application. All of the references are incorporated herein by reference.

Technical field

The present application relates to the field of data analysis technologies, and in particular, to a method for predicting the risk of chronic obstructive pulmonary disease, a server, and a computer readable storage medium.

Background technique

Chronic obstructive pulmonary disease, a chronic obstructive pulmonary disease, is characterized by incomplete reversible airflow limitation. Restricted pulmonary obstruction is often progressively aggravated with an abnormal inflammatory response to the lungs caused by harmful particles or gases, mainly smoking. Although chronic obstructive pulmonary disease directly affects the lungs, it can also cause significant systemic effects. Chronic cough and cough are often preceded by airflow limitation for many years, but not all patients with cough and cough symptoms develop chronic obstructive pulmonary disease. To clearly diagnose chronic obstructive pulmonary disease, a pulmonary function test is required. Chronic obstructive pulmonary disease has a high mortality rate; accompanied by shortness of breath, cough, wheezing and repeated aggravation; not only damages the airways, alveoli and pulmonary blood vessels, but also damages extrapulmonary tissues such as bones, skeletal muscles, heart and other organs; A multigene systemic disease. There are large individual differences in clinical manifestations, duration of disease, and response to medications.

The academic risk assessment model for chronic obstructive pulmonary disease is based on the method of expert scoring, selecting important factors, setting scores for each factor, and performing comprehensive scoring. Among these scoring methods, fewer influencing factors are involved and the accuracy is lower. And the data acquisition of the scoring method is more difficult, and it is difficult to achieve risk assessment for large-scale populations.

Summary of the invention

In view of this, the present application proposes a chronic obstructive pulmonary disease risk prediction method, a server, and a computer readable storage medium to solve the problem of how to conveniently and accurately predict the risk of chronic obstructive pulmonary disease.

First, in order to achieve the above object, the present application proposes a method for predicting the risk of chronic obstructive pulmonary disease, which comprises the steps of:

Set the range of user information that needs to be obtained;

Obtaining relevant sample data according to the range of user information;

Establishing a plurality of models according to the sample data, performing training and testing, and screening the optimal model combination;

Establishing a combined classifier model according to the optimal model combination; and

The chronic obstructive pulmonary disease risk prediction is performed according to the combined classifier model and user personal information.

In addition, in order to achieve the above object, the present application further provides a server, including a memory and a processor, where the memory stores a chronic obstructive pulmonary disease risk prediction system operable on the processor, and the chronic obstructive pulmonary disease occurs. The risk prediction system is implemented by the processor to implement the steps of the chronic obstructive pulmonary disease risk prediction method as described above.

Further, in order to achieve the above object, the present application further provides a computer readable storage medium storing a chronic obstructive pulmonary disease risk prediction system, wherein the chronic obstructive pulmonary disease risk prediction system can be at least one The processor executes to cause the at least one processor to perform the steps of the chronic obstructive pulmonary disease risk prediction method as described above.

Compared with the prior art, the method for predicting the risk of chronic obstructive pulmonary disease, the server and the computer readable storage medium proposed by the present application can establish a slow resistance covering all aspects of the user's health files, interests, consumption, living habits and the like. The lung prediction model uses principal component analysis and feature screening methods to screen and reduce dimensionality of feature data, extract important features from it, and then construct a training set and test set according to 10-fold cross-validation, which is used to screen the optimal model combination. The results of each model in the combination are weighted to obtain the final combined classifier model. The model is established by the xgboost algorithm to predict the risk of chronic obstructive pulmonary disease in the next year. The program considers the factors affecting the incidence of chronic obstructive pulmonary disease. The prediction accuracy is high, and the implementation is convenient, and the prediction effect is significantly improved.

DRAWINGS

1 is a schematic diagram of an optional hardware architecture of the server of the present application;

2 is a schematic diagram of a program module of the first embodiment of the chronic obstructive pulmonary disease risk prediction system of the present application;

3 is a schematic diagram of a program module of a second embodiment of the chronic obstructive pulmonary disease risk prediction system of the present application;

4 is a schematic flow chart of a first embodiment of a method for predicting the risk of chronic obstructive pulmonary disease according to the present application;

FIG. 5 is a schematic flow chart of a second embodiment of a method for predicting the risk of chronic obstructive pulmonary disease according to the present application.

Reference mark:

服务器server	22
存储器Memory	1111
处理器processor	1212
网络接口Network Interface	1313
慢阻肺发病风险预测系统Chronic obstructive pulmonary disease risk prediction system	200200
设置模块 Setting module	201201
获取模块 Acquisition module	202202
建模模块 Modeling module	203203
组合模块 Combination module	204204
预测模块 Prediction module	205205
预处理模块 Preprocessing module	206206

The implementation, functional features and advantages of the present application will be further described with reference to the accompanying drawings.

Detailed ways

In order to make the objects, technical solutions, and advantages of the present application more comprehensible, the present application will be further described in detail below with reference to the accompanying drawings and embodiments. It is understood that the specific embodiments described herein are merely illustrative of the application and are not intended to be limiting. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present application without departing from the inventive scope are the scope of the present application.

It should be noted that the descriptions of "first", "second" and the like in the present application are for the purpose of description only, and are not to be construed as indicating or implying their relative importance or implicitly indicating the number of technical features indicated. . Thus, features defining "first" or "second" may include at least one of the features, either explicitly or implicitly. In addition, the technical solutions between the various embodiments may be combined with each other, but must be based on the realization of those skilled in the art, and when the combination of the technical solutions is contradictory or impossible to implement, it should be considered that the combination of the technical solutions does not exist. Nor is it within the scope of protection required by this application.

Referring to FIG. 1, it is a schematic diagram of an optional hardware architecture of the server 2 of the present application.

In this embodiment, the server 2 may include, but is not limited to, the memory 11, the processor 12, and the network interface 13 being communicably connected to each other through a system bus. It is pointed out that Figure 1 only shows the server 2 with the components 11-13, but it should be understood that not all illustrated components may be implemented, and more or fewer components may be implemented instead.

The server 2 may be a computing device such as a rack server, a blade server, a tower server, or a rack server. The server 2 may be an independent server or a server cluster composed of multiple servers.

The memory 11 includes at least one type of readable storage medium including a flash memory, a hard disk, a multimedia card, a card type memory (eg, SD or DX memory, etc.), a random access memory (RAM), a static Random access memory (SRAM), read only memory (ROM), electrically erasable programmable read only memory (EEPROM), programmable read only memory (PROM), magnetic memory, magnetic disk, optical disk, and the like. In some embodiments, the memory 11 may be an internal storage unit of the server 2, such as a hard disk or memory of the server 2. In other embodiments, the memory 11 may also be an external storage device of the server 2, such as a plug-in hard disk equipped on the server 2, a smart memory card (SMC), and a secure digital (Secure) Digital, SD) cards, flash cards, etc. Of course, the memory 11 can also include both the internal storage unit of the server 2 and its external storage device. In this embodiment, the memory 11 is generally used to store an operating system installed in the server 2 and various types of application software, such as program code of the chronic obstructive pulmonary disease risk prediction system 200. Further, the memory 11 can also be used to temporarily store various types of data that have been output or are to be output.

The processor 12 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments. The processor 12 is typically used to control the overall operation of the server 2. In this embodiment, the processor 12 is configured to run program code or process data stored in the memory 11, such as running the chronic obstructive pulmonary disease risk prediction system 200 and the like.

The network interface 13 may comprise a wireless network interface or a wired network interface, which is typically used to establish a communication connection between the server 2 and other electronic devices.

So far, the hardware structure and functions of the devices related to this application have been described in detail. Hereinafter, various embodiments of the present application will be made based on the above description.

First, the present application proposes a chronic obstructive pulmonary disease risk prediction system 200.

Referring to FIG. 2, it is a program block diagram of the first embodiment of the chronic obstructive pulmonary disease risk prediction system 200 of the present application.

In this embodiment, the chronic obstructive pulmonary disease risk prediction system 200 includes a series of computer program instructions stored on the memory 11, and when the computer program instructions are executed by the processor 12, the slowness of the embodiments of the present application may be implemented. Prevention of lung disease risk prediction operations. In some embodiments, the COPD risk prediction system 200 can be divided into one or more modules based on the particular operations implemented by the various portions of the computer program instructions. For example, in FIG. 2, the COPD risk prediction system 200 can be segmented into a setup module 201, an acquisition module 202, a modeling module 203, a combination module 204, and a prediction module 205. among them:

The setting module 201 is configured to set a range of user information that needs to be acquired.

Specifically, since the chronic obstructive pulmonary disease risk prediction cannot be accurately performed based only on the user's health information, a more comprehensive influencing factor needs to be considered in the range of the user information. In this embodiment, the user information range includes the user's health file, hobbies, spending habits, living habits, and the like. The user information coverage covers all aspects of the user's information, and is not limited to health information, so as to make a more comprehensive and accurate prediction of the risk of chronic obstructive pulmonary disease.

The obtaining module 202 is configured to obtain related sample data according to the range of user information.

Specifically, for each user, according to the set user information range, data of multiple dimensions such as a health file, a hobby, a consumption habit, a living habit, and the like corresponding to the user are obtained from the corresponding data source. For example, obtain a user health file from a hospital or insurance company database, and obtain user spending habits from a bank database. In the present embodiment, the corresponding data of the user of the preset region (for example, the entire city) may be used as the sample data.

The modeling module 203 is configured to establish multiple models according to the sample data, perform training and testing, and filter the optimal model combination.

Specifically, the obtained sample data is modeled by an xgboost algorithm, and the objective function in the algorithm selects a logistic regression function. The xgboost algorithm can combine n different models and filter the optimal model combination through training and testing, that is, the optimal n value.

In the present embodiment, a training set and a test set are constructed in accordance with a 10-fold cross validation method for screening an optimal model combination. The 10-fold cross-validation, that is, dividing the data set into 10 parts, takes 9 of them as training set data and 1 part as test set data in turn, and tests. The corresponding correct rate (or error rate) is obtained for each test, and the average of the correct rate (or error rate) of the 10 results is used as an estimate of the accuracy of the algorithm. In addition, it is also possible to perform multiple 10-fold cross-validation (for example, 10 10-fold cross-validation) and then find the mean value as an estimate of the accuracy of the algorithm.

In the present embodiment, the sample data is divided into 10 parts, 9 of which are used as training set data, and the data dimension affecting the risk of chronic obstructive pulmonary disease is analyzed, and the risk of chronic obstructive pulmonary disease in each data dimension is analyzed. The degree of influence (such as a score) to establish a model, and then the remaining 1 copy as the test set data to verify the correct rate of the above analysis (the model). By taking 9 of the sample data as training set data in turn and 1 part as test set data, 10 models can be obtained. The optimal model combination is then screened according to the xgboost algorithm.

The combining module 204 is configured to establish a combined classifier model according to the optimal model combination.

Specifically, each model prediction result in the optimal model combination is weighted to obtain a final combined classifier model. Therefore, the risk of chronic obstructive pulmonary disease is predicted for users whose disease status is unknown.

The combined classifier is an algorithm that integrates multiple models, such as the xgboost algorithm. When the optimal model combination is obtained, the prediction results of the n models are weighted, which is the final combined classifier model. The result of the combined classifier model output is weighted by the results of the n models to obtain a final prediction result.

The prediction module 205 is configured to perform a chronic obstructive pulmonary disease risk prediction according to the combined classifier model and user personal information.

Specifically, when predicting a chronic obstructive pulmonary disease risk for a certain user, according to the input parameters of the combined classifier model (ie, which dimension data is needed, such as the user's health file, hobbies, consumption habits, lifestyle habits) And obtaining user information data corresponding to the user, inputting the acquired data into the combined classifier model, respectively predicting each of the models, obtaining a plurality of prediction results, and then performing the weighting according to the weight of each model. Multiple prediction results are combined (weighted calculation) to obtain the final prediction result, which is the risk of chronic obstructive pulmonary disease in the user in the next year.

Referring to FIG. 3, it is a program block diagram of a second embodiment of the chronic obstructive pulmonary disease risk prediction system 200 of the present application. In this embodiment, the slow-resistance lung disease risk prediction system 200 includes the setting module 201, the acquisition module 202, the modeling module 203, the combination module 204, and the prediction module 205 in the first embodiment. A pre-processing module 206 is included.

The pre-processing module 206 is configured to perform missing value and outlier processing on the sample data, and perform dimensionality reduction.

Specifically, the dimension data of the user is first processed with missing values and outliers, including deleting data with too low saturation, and the outliers are treated as missing values, and the missing values are filled by the feature filling method. The continuous values are then discretized, and principal component analysis (PCA) and feature screening methods are used for dimensionality reduction.

The discretization of the continuous value is to divide the continuous value into equal or equal frequency bins, for example, the age is a continuous value, and is divided into 0-10, 11-20, ..., 91-100 according to an age group of 10 years old. For each age group, a continuous age field is finally converted into 10 classification fields.

The main component of the principal component analysis is to reduce the dimensions of the data set and then select the most important features or combinations of features. The main processes of principal component analysis are: standardization of raw data; calculation of correlation coefficient matrix between standardized variables; calculation of eigenvalues and eigenvectors of correlation coefficient matrix; calculation of principal component variable values; analysis of statistical results, extraction of required principal components. After dimension reduction by the principal component analysis method, important data dimensions can be extracted from the sample data.

In addition, the present application also proposes a method for predicting the risk of chronic obstructive pulmonary disease.

Referring to FIG. 4, it is a schematic flowchart of the first embodiment of the method for predicting the risk of developing chronic obstructive pulmonary disease. In this embodiment, the order of execution of the steps in the flowchart shown in FIG. 4 may be changed according to different requirements, and some steps may be omitted.

Step S400, setting a range of user information that needs to be acquired.

Step S402, acquiring relevant sample data according to the range of user information.

Step S404, establishing a plurality of models according to the sample data, performing training and testing, and screening the optimal model combination.

In this embodiment, the sample data is divided into 10 parts, 9 of which are used as training set data, and the data dimension affecting the risk of chronic obstructive pulmonary disease is analyzed, and the risk of chronic obstructive pulmonary disease in each data dimension is analyzed. The degree of influence (such as a score) to establish a model, and then the remaining 1 copy as the test set data to verify the correct rate of the above analysis (the model). By taking 9 of the sample data as training set data in turn and 1 part as test set data, 10 models can be obtained. The optimal model combination is then screened according to the xgboost algorithm.

Step S406, establishing a combined classifier model according to the optimal model combination.

Step S408, predicting the risk of chronic obstructive pulmonary disease according to the combined classifier model and user personal information.

The method for predicting the risk of chronic obstructive pulmonary disease proposed in this embodiment can establish a chronic obstructive lung prediction model covering the user's health records, interests, consumption, living habits and other comprehensive information, and then construct a training set and test according to a 10-fold cross-validation. The set is used to screen the optimal model combination, and the results of each model in the combination are weighted to obtain the final combined classifier model, which is established by the xgboost algorithm to realize the risk prediction of chronic obstructive pulmonary disease for the individual in the next year. The program considers the factors affecting the incidence of chronic obstructive pulmonary disease comprehensively, has high prediction accuracy, and is easy to implement, and the prediction effect is significantly improved.

Referring to FIG. 5, it is a schematic flowchart of a second embodiment of the method for predicting the risk of developing chronic obstructive pulmonary disease. In this embodiment, the steps S500-S502 and S506-S510 of the chronic obstructive pulmonary disease risk prediction method are similar to the steps S400-S408 of the first embodiment, except that the method further includes step S504.

Step S500, setting a range of user information that needs to be acquired.

Specifically, since the chronic obstructive pulmonary disease risk prediction cannot be accurately performed based only on the user's health information, a more comprehensive influencing factor needs to be considered in the range of the user information. In this embodiment, the user information range includes the user's health file, hobbies, spending habits, living habits, and the like. The user information coverage covers all aspects of the user's information, and is not limited to health information, so as to provide a more comprehensive and accurate prediction of the risk of chronic obstructive pulmonary disease.

Step S502, acquiring relevant sample data according to the range of user information.

Step S504, performing missing value and outlier processing on the sample data, and performing dimensionality reduction.

Step S506, multiple models are established according to the data obtained after dimension reduction, and training and testing are performed to filter the optimal model combination.

Specifically, specifically, the data obtained after the dimension reduction is modeled by the xgboost algorithm, and the objective function in the algorithm selects a logistic regression function. The xgboost algorithm can combine n different models and filter the optimal model combination through training and testing, that is, the optimal n value.

Step S508, establishing a combined classifier model according to the optimal model combination.

The combined classifier is an algorithm that integrates multiple models, such as the xgboost algorithm. When the optimal model combination is obtained, the prediction results of the n models are weighted, which is the final combined classifier model. The result of the combined classifier model output is a final prediction result obtained by weighting the results of the n models.

Step S510, predicting the risk of chronic obstructive pulmonary disease according to the combined classifier model and user personal information.

The method for predicting the risk of chronic obstructive pulmonary disease proposed in this embodiment can establish a chronic obstructive lung prediction model covering the user's health records, interests, consumption, living habits and the like, and using principal component analysis and feature screening methods to characterize The data is filtered and dimension-reduced, and important features are extracted from it. Then the training set and test set are constructed according to the 10-fold cross-validation, which is used to screen the optimal model combination, and the results of each model in the combination are weighted to obtain the final combined classifier model. The model is established by the xgboost algorithm to predict the risk of chronic obstructive pulmonary disease in the next year. The program considers the factors affecting the incidence of chronic obstructive pulmonary disease comprehensively, has high prediction accuracy, and is convenient to implement, and the prediction effect is significantly improved.

The serial numbers of the embodiments of the present application are merely for the description, and do not represent the advantages and disadvantages of the embodiments.

Through the description of the above embodiments, those skilled in the art can clearly understand that the foregoing embodiment method can be implemented by means of software plus a necessary general hardware platform, and of course, can also be through hardware, but in many cases, the former is better. Implementation. Based on such understanding, the technical solution of the present application, which is essential or contributes to the prior art, may be embodied in the form of a software product stored in a storage medium (such as ROM/RAM, disk, The optical disc includes a number of instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the methods described in various embodiments of the present application.

The above is only a preferred embodiment of the present application, and is not intended to limit the scope of the patent application, and the equivalent structure or equivalent process transformations made by the specification and the drawings of the present application, or directly or indirectly applied to other related technical fields. The same is included in the scope of patent protection of this application.

Claims

A method for predicting the risk of chronic obstructive pulmonary disease, which is applied to a server, characterized in that the method comprises the steps of:

Set the range of user information that needs to be obtained;

Obtaining relevant sample data according to the range of user information;

Establishing a plurality of models according to the sample data, performing training and testing, and screening the optimal model combination;

Establishing a combined classifier model according to the optimal model combination; and

The chronic obstructive pulmonary disease risk prediction is performed according to the combined classifier model and user personal information.
The method for predicting the risk of chronic obstructive pulmonary disease according to claim 1, wherein the method further comprises the steps of: step of establishing a plurality of models based on the sample data:

The sample data is subjected to missing value and outlier processing, and dimensionality reduction is performed.
The method for predicting the risk of developing chronic obstructive pulmonary disease according to claim 1 or 2, wherein the user information range includes the user's health file, hobbies, consumption habits, and living habits.
The method for predicting the risk of developing a chronic obstructive pulmonary disease according to claim 2, wherein the step of performing the missing value and the abnormal value processing on the sample data comprises:

Deleting the data with too low saturation, the outliers are treated as missing values, the missing values are filled by the feature filling method, and the continuous values are discretized.
The chronic obstructive pulmonary disease risk prediction method according to claim 2, wherein the dimensionality reduction is performed by a principal component analysis and a feature screening method.
The method according to claim 1 or 2, wherein the model is established by an xgboost algorithm.
The method for predicting the risk of developing chronic obstructive pulmonary disease according to claim 1 or 2, wherein the training set and the test set are constructed according to a 10-fold cross-validation method to filter the optimal model combination.
The method for predicting the risk of developing a chronic obstructive pulmonary disease according to claim 1 or 2, wherein the step of establishing a combined classifier model according to the optimal model combination comprises:

After the optimal model combination is obtained, the prediction results of the n models are weighted to obtain the combined classifier model, and the output of the combined classifier model is weighted by the prediction results of the n models. The final forecast.
A server, comprising: a memory, a processor, wherein the memory stores a chronic obstructive pulmonary disease risk prediction system operable on the processor, wherein the chronic obstructive pulmonary disease risk prediction system is The processor implements the following steps when executed:

Set the range of user information that needs to be obtained;

Obtaining relevant sample data according to the range of user information;

Establishing a plurality of models according to the sample data, performing training and testing, and screening the optimal model combination;

Establishing a combined classifier model according to the optimal model combination; and

The chronic obstructive pulmonary disease risk prediction is performed according to the combined classifier model and user personal information.
The server according to claim 9, wherein before the step of establishing a plurality of models based on said sample data, the method further comprises the steps of:

The sample data is subjected to missing value and outlier processing, and dimensionality reduction is performed.
The server according to claim 9 or 10, wherein the user information range includes the user's health profile, hobbies, spending habits, and living habits.
The server according to claim 10, wherein the step of performing the missing value and the outlier processing on the sample data comprises:

Deleting the data with too low saturation, the outliers are treated as missing values, the missing values are filled by the feature filling method, and the continuous values are discretized.
The server according to claim 10, wherein said dimensionality reduction is performed by a principal component analysis and a feature screening method.
A server according to claim 9 or 10, wherein said model is established by an xgboost algorithm.
A server according to claim 9 or 10, wherein the training set and the test set are constructed in accordance with a 10-fold cross-validation method to filter the optimal model combination.
The server according to claim 9 or 10, wherein the step of establishing a combined classifier model according to the optimal model combination comprises:

After the optimal model combination is obtained, the prediction results of the n models are weighted to obtain the combined classifier model, and the output of the combined classifier model is weighted by the prediction results of the n models. The final forecast.

The method of the method for predicting the risk of developing chronic obstructive pulmonary disease according to any one of claims 1-8.
A computer readable storage medium storing a chronic obstructive pulmonary disease risk prediction system executable by at least one processor to cause the at least one processor Perform the following steps:

Set the range of user information that needs to be obtained;

Obtaining relevant sample data according to the range of user information;

Establishing a plurality of models according to the sample data, performing training and testing, and screening the optimal model combination;

Establishing a combined classifier model according to the optimal model combination; and

The chronic obstructive pulmonary disease risk prediction is performed according to the combined classifier model and user personal information.
A computer readable storage medium according to claim 17, wherein before the step of establishing a plurality of models based on said sample data, the method further comprises the steps of:

The sample data is subjected to missing value and outlier processing, and dimensionality reduction is performed.
The computer readable storage medium according to claim 18, wherein the step of performing the missing value and the outlier processing on the sample data comprises:

Deleting the data with too low saturation, the outliers are treated as missing values, the missing values are filled by the feature filling method, and the continuous values are discretized.
The computer readable storage medium according to claim 17 or 18, wherein the step of establishing a combined classifier model according to the optimal model combination comprises:

After the optimal model combination is obtained, the prediction results of the n models are weighted to obtain the combined classifier model, and the output of the combined classifier model is weighted by the prediction results of the n models. The final forecast.