CN114566284A

CN114566284A - Disease prognosis risk prediction model training method and device and electronic equipment

Info

Publication number: CN114566284A
Application number: CN202210255041.XA
Authority: CN
Inventors: 杜鑫惠; 王绍博
Original assignee: Yidu Cloud Beijing Technology Co Ltd
Current assignee: Yidu Cloud Beijing Technology Co Ltd
Priority date: 2021-06-07
Filing date: 2022-03-15
Publication date: 2022-05-31
Also published as: CN113345584A

Abstract

The application provides a disease prognosis risk prediction model training method, a device, electronic equipment and a computer readable storage medium; the method comprises the following steps: acquiring medical data of a plurality of samples, wherein the medical data of each sample comprises a plurality of characteristics with different dimensions and risk marks; training a tree-based model by taking the characteristics of each sample as granularity; determining a plurality of risk rules according to a path of the tree-based model, each risk rule comprising features of at least two different dimensions; performing regular pruning on the plurality of risk rules; and performing model training according to the medical data of each sample and the plurality of risk rules after pruning to obtain a disease prognosis risk prediction model, so as to predict the medical data of the patient through the disease prognosis risk prediction model. By the method and the device, the accuracy of disease prognosis risk prediction can be improved.

Description

Disease prognosis risk prediction model training method and device and electronic equipment

Technical Field

The present application relates to artificial intelligence and big data technologies, and in particular, to a disease prognosis risk prediction model training method, apparatus, electronic device, and computer-readable storage medium.

Background

Disease prognosis is the understanding of a disease, and in addition to the first understanding of its clinical manifestations, laboratory and imaging, etiology, pathology, disease rules, etc., it is important to evaluate the near and far term efficacy, outcome recovery or degree of progression of the disease based on the timing and methods of treatment in combination with new conditions found in the treatment procedure. Disease prognosis is related to many factors such as the treatment timing of a patient, the degree of occurrence of a disease, the medical level, the disease in combination, the personal abilities of a doctor, the constitution, the age, whether a patient is looking right at a disease or the cognitive ability of a disease, whether treatment is continued, and the like. Determining the disease prognosis risk is an important task in the medical field, however, the prior art estimates the disease prognosis risk are usually judged according to independent factors, and the nonlinear correlation between the factors is not considered.

In the case of heart disease, heart failure is a severe or advanced manifestation of heart disease. Heart failure is a clinical syndrome characterized by blood congestion in the pulmonary or systemic circulation, and inadequate blood perfusion of organs and tissues. Because heart failure is characterized by high mortality and readmission rates, risk assessment for the prognosis of heart failure is of particular importance.

In the solutions provided in the related art, it is proposed to predict mortality of heart failure based on the mortality of patients with heart failure in the seattle high frequency model, using a model of enhanced feedback for effective addition therapy (EFFECT), and using classification and regression tree models to predict in-hospital mortality and risk stratification for memory decompensated heart failure.

In the solution provided by the related art, a medical Knowledge mining pipeline based on time-series mode mining is proposed for early detection of congestive heart failure, or mining medical Knowledge through a Refined-Clinical Knowledge Model (R-CKM).

However, none of the solutions provided by the related art mentioned above is able to model based on the non-linear relationship between features, and only performs medical knowledge mining based on the diagnostic dimension, so that the accuracy of mining medical knowledge is not high.

Disclosure of Invention

The embodiment of the application provides a disease prognosis risk prediction model training method and device, electronic equipment and a computer readable storage medium, which can be used for training a disease prognosis risk prediction model based on a nonlinear relation among characteristics and improving the accuracy of mining medical knowledge.

The technical scheme of the embodiment of the application is realized as follows:

in a first aspect, the present application provides a method for training a disease prognosis risk prediction model, including:

acquiring medical data of a plurality of samples, wherein the medical data of each sample comprises a plurality of characteristics with different dimensions and risk marks;

training a tree-based model by taking the characteristics of each sample as granularity;

determining a plurality of risk rules according to a path of the tree-based model, each risk rule comprising features of at least two different dimensions;

performing regular pruning on the plurality of risk rules;

and performing model training according to the medical data of each sample and the plurality of risk rules after pruning to obtain a disease prognosis risk prediction model, so as to predict the medical data of the patient through the disease prognosis risk prediction model.

In some embodiments, said performing rule pruning on said plurality of risk rules comprises:

traversing the plurality of risk rules, and determining the number of patients and the number of deaths of the patients corresponding to each risk rule according to the medical data of the plurality of samples;

determining survival proportion, medical effectiveness and importance of the risk rules of the patients corresponding to each rule within a first preset time based on the number of the patients corresponding to each risk rule and the number of the dead patients;

pruning the risk rule determined according to the tree-based model based on the survival proportion, the medical effectiveness and the importance of the risk rule in the first preset time of the patient.

In some embodiments, the importance of the risk rule is related to the proportion of patients who meet the risk rule.

In some embodiments, the method further comprises:

performing medical knowledge mining analysis on each risk rule based on medical parameters to obtain the relationship between characteristics and disease prognosis risk;

the medical parameters include at least: survival proportion of the patient within a first preset time, medical effectiveness, importance of risk rules and coefficients of the risk rules.

In some embodiments, said performing a medical knowledge mining analysis on each of said risk rules based on medical parameters, deriving a relationship between a feature and a disease prognosis risk comprises:

if the medical parameter meets a first condition, determining that the disease prognosis risk corresponding to the feature included in the risk rule corresponding to the medical parameter is a high risk; otherwise, it is determined to be at low risk.

In some embodiments, the first condition comprises:

the survival rate of the patient in the first preset time is smaller than a first threshold, the medical effectiveness is smaller than a second threshold, the importance of the risk rule is larger than a third threshold, and the coefficient of the risk rule is larger than a fourth threshold.

In some embodiments, the model training according to the medical data and the pruned risk rules of each sample comprises:

inputting the plurality of risk rules after pruning and the medical data into the disease prognosis risk prediction model to obtain a disease prognosis risk prediction result output by the disease prognosis risk prediction model;

determining a difference between the disease prognosis risk prediction outcome and the risk marker;

adjusting a parameter of the disease prognosis risk prediction model based on the difference.

In a second aspect, embodiments of the present application further provide a method for predicting disease prognosis risk, the method comprising:

acquiring medical data of a first patient, the medical data comprising a plurality of features of different dimensions;

inputting the medical data of the first patient into a trained disease prognosis risk prediction model, and determining a disease prognosis risk of the first patient based on an output of the disease prognosis risk prediction model;

the disease prognosis risk prediction model is obtained by training based on the disease prognosis risk prediction model training method provided by the embodiment of the application.

In a third aspect, an embodiment of the present application provides a disease prognosis risk prediction model training device, including:

the training set building module is used for acquiring medical data of a plurality of samples, and the medical data of each sample comprises a plurality of characteristics with different dimensions and risk markers;

a risk rule determining module for training the tree-based model with the characteristics of the samples as granularity; determining a plurality of risk rules according to a path of the tree-based model, each risk rule comprising features of at least two different dimensions; performing regular pruning on the plurality of risk rules;

and the prediction model training module is used for carrying out model training according to the medical data of each sample and the plurality of risk rules after pruning to obtain a disease prognosis risk prediction model so as to predict the medical data of the patient through the disease prognosis risk prediction model.

In a fourth aspect, embodiments of the present application provide a disease prognosis risk prediction apparatus, including:

an acquisition module to acquire medical data of a first patient, the medical data including a plurality of features of different dimensions;

a prediction module for inputting medical data of the first patient to a trained disease prognosis risk prediction model, determining a disease prognosis risk of the first patient based on an output of the disease prognosis risk prediction model;

In a fifth aspect, an embodiment of the present application provides an electronic device, including:

a memory for storing executable instructions;

and the processor is used for implementing the disease prognosis risk prediction model training method provided by the embodiment of the application or implementing the disease prognosis risk prediction provided by the embodiment of the application when the executable instructions stored in the memory are executed.

In a sixth aspect, embodiments of the present application provide a computer-readable storage medium storing executable instructions for implementing a method for training a disease prognosis risk prediction model provided in embodiments of the present application or implementing a disease prognosis risk prediction provided in embodiments of the present application when executed by a processor.

According to the disease prognosis risk prediction model training method provided by the embodiment of the application, medical data of a plurality of samples are obtained, and the medical data of each sample comprises a plurality of characteristics with different dimensions and risk types; determining a plurality of risk rules according to a path of a preset tree-based model, wherein each risk rule comprises at least two characteristics with different dimensions; performing regular pruning on the plurality of risk rules; performing model training according to the medical data of each sample and the plurality of risk rules after pruning to obtain a disease prognosis risk prediction model so as to predict the medical data of the patient through the disease prognosis risk prediction model; and predicting the disease prognosis risk of the patient based on the trained disease prognosis risk prediction model. Because the risk rule comprises a plurality of characteristics, the disease prognosis risk prediction model can predict the disease prognosis risk based on the nonlinear relation among the characteristics, and the accuracy of the disease prognosis risk prediction is improved. Moreover, the embodiment of the application can also carry out medical knowledge mining on each risk rule based on medical parameters to obtain the relationship between the characteristics and the disease prognosis risk, thereby improving the accuracy of the medical knowledge mining and providing supporting evidence for the unverified hypothesis.

Drawings

FIG. 1 is a schematic diagram of an architecture of a disease prognosis risk prediction model training system provided in an embodiment of the present application;

fig. 2 is a schematic architecture diagram of a terminal device provided in an embodiment of the present application;

FIG. 3 is a schematic flow chart of a disease prognosis risk prediction model training method provided in an embodiment of the present application;

FIG. 4 is a schematic view of an alternative process flow for training a disease prognosis risk prediction model according to an embodiment of the present application;

FIG. 5 is a performance diagram of three models provided by embodiments of the present application;

FIG. 6 is a schematic diagram illustrating an alternative process flow for rule pruning of the risk rules determined according to the tree-based model according to an embodiment of the present application;

fig. 7 is a schematic diagram of risk rules obtained after performing risk rule pruning based on statistical data according to the embodiment of the present application;

FIG. 8 is a schematic diagram of a disease prognosis risk prediction model provided in the embodiments of the present application.

Detailed Description

In order to make the objectives, technical solutions and advantages of the present application clearer, the present application will be described in further detail with reference to the attached drawings, the described embodiments should not be considered as limiting the present application, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.

In the following description, references to the terms "first \ second \ third" are only to distinguish similar objects and do not denote a particular order, but rather the terms "first \ second \ third" are used to interchange specific orders or sequences, where appropriate, so as to enable the embodiments of the application described herein to be practiced in other than the order shown or described herein. In the following description, the term "plurality" referred to means at least two.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the application.

Before further detailed description of the embodiments of the present application, terms and expressions referred to in the embodiments of the present application will be described, and the terms and expressions referred to in the embodiments of the present application will be used for the following explanation.

1) A training set, a collection of samples (also referred to as training samples) used to train a machine learning model in a supervised manner.

The samples in the training set include features of the samples (e.g., features of multiple dimensions) and explicitly valued target variables, so that the machine learning model can discover rules between predicting the target variables from the features of the samples, thereby having the performance of predicting the values of the target variables based on the features of the samples.

2) A Gradient Boosting (Gradient Boosting) method, or a Gradient Boosting Decision Tree (GBDT) method, which is a method for iteratively training a strong classifier (a function whose classification performance is sufficient for individually classifying samples) formed by linear combination of a plurality of weak classifiers (functions whose classification performance is insufficient for individually classifying samples), and updates a model by adding a function to the trained model according to the Gradient direction of a loss function of the model after each iterative training, so that the prediction loss of the model can be reduced along the Gradient direction after each iterative training.

3) An Extreme Gradient Boosting (XGboost) method and a C + + realization method of a Gradient Boosting decision tree method support the utilization of multiple threads of processors such as a Graphic Processing Unit (GPU) and a Central Processing Unit (CPU) to carry out a parallel training model, and meanwhile, the accuracy is improved in algorithm.

4) Gradient Descent methods (Gradient Descent methods), methods for solving the maximum value of a loss function in the Gradient Descent direction, include a small-Batch Gradient Descent method (Mini-Batch Gradient Descent method), a Batch Gradient Descent method (Batch Gradient Descent method, BGD for short), a random Gradient Descent method (Stochastic Gradient Descent method), and the like.

The embodiment of the application provides a disease prognosis risk prediction model training method and device, electronic equipment and a computer readable storage medium, which can be used for training a disease prognosis risk prediction model based on a nonlinear relation among characteristics and improving the accuracy of mining medical knowledge. An exemplary application of the electronic device provided in the embodiment of the present application is described below, and the electronic device provided in the embodiment of the present application may be implemented as various types of terminal devices, and may also be implemented as a server.

Referring to fig. 1, fig. 1 is an architecture diagram of a disease prognosis risk prediction model training system 100 provided in an embodiment of the present application, a terminal device 400 is connected to a server 200 through a network 300, and the server 200 is connected to a database 500, where the network 300 may be a wide area network or a local area network, or a combination of the two.

In some embodiments, taking the electronic device as a terminal device as an example, the disease prognosis risk prediction model training method provided in the embodiments of the present application may be implemented by the terminal device. For example, the terminal device 400 runs a client 410, and the client 410 may be a client for performing disease prognosis risk prediction model training.

In some embodiments, taking the electronic device as a server as an example, the disease prognosis risk prediction model training method provided in the embodiments of the present application may be cooperatively implemented by the server and the terminal device. For example, the server 200 obtains the health record of the patient from the database 500, constructs a training set based on the health record of the patient, and trains the tree-based model with the feature of each sample in the training set as the granularity; determining risk rules according to paths of the tree-based model, each risk rule comprising features of at least two dimensions; training a disease prognosis risk prediction model based on the risk rules and the characteristics of each sample in the training set, the disease prognosis risk prediction model having an attribute of predicting the disease prognosis risk of the patient. The server 200 sends the trained disease prognosis risk prediction model to the client 410.

In some embodiments, the terminal device 400 or the server 200 may implement the disease prognosis risk prediction model training method provided by the embodiments of the present application by running a computer program, for example, the computer program may be a native program or a software module in an operating system; can be a local (Native) Application program (APP), i.e. a program that needs to be installed in an operating system to run; or may be an applet, i.e. a program that can be run only by downloading it to the browser environment; but also an applet that can be embedded into any APP. In general, the computer programs described above may be any form of application, module or plug-in.

In some embodiments, the server 200 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a Cloud server providing basic Cloud computing services such as a Cloud service, a Cloud database, Cloud computing, a Cloud function, Cloud storage, a web service, Cloud communication, a middleware service, a domain name service, a security service, a CDN, and a big data and artificial intelligence platform, where Cloud Technology (Cloud Technology) refers to a hosting Technology for unifying resources of hardware, software, a network, and the like in a wide area network or a local area network to implement computing, storage, processing, and sharing of data. The terminal device 400 may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, and the like. The terminal device and the server may be directly or indirectly connected through wired or wireless communication, and the embodiment of the present application is not limited.

Taking the electronic device provided in the embodiment of the present application as an example for illustration, it can be understood that, for the case where the electronic device is a server, parts (such as the user interface, the presentation module, and the input processing module) in the structure shown in fig. 2 may be default. Referring to fig. 2, fig. 2 is a schematic structural diagram of a terminal device 400 provided in an embodiment of the present application, where the terminal device 400 shown in fig. 2 includes: at least one processor 460, memory 450, at least one network interface 420, and a user interface 430. The various components in the terminal 400 are coupled together by a bus system 440. It is understood that the bus system 440 is used to enable communications among the components. The bus system 440 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 440 in FIG. 2.

The Processor 460 may be an integrated circuit chip having Signal processing capabilities, such as a general purpose Processor, a Digital Signal Processor (DSP), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like, wherein the general purpose Processor may be a microprocessor or any conventional Processor, or the like.

The user interface 430 includes one or more output devices 431, including one or more speakers and/or one or more visual displays, that enable the presentation of media content. The user interface 430 also includes one or more input devices 432, including user interface components that facilitate user input, such as a keyboard, mouse, microphone, touch screen display, camera, other input buttons and controls.

The memory 450 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard disk drives, optical disk drives, and the like. Memory 450 optionally includes one or more storage devices physically located remote from processor 460.

The memory 450 includes either volatile memory or nonvolatile memory, and may include both volatile and nonvolatile memory. The nonvolatile Memory may be a Read Only Memory (ROM), and the volatile Memory may be a Random Access Memory (RAM). The memory 450 described in embodiments herein is intended to comprise any suitable type of memory.

In some embodiments, memory 450 is capable of storing data to support various operations, examples of which include programs, modules, and data structures, or subsets or supersets thereof, as exemplified below.

An operating system 451, including system programs for handling various basic system services and performing hardware-related tasks, such as a framework layer, a core library layer, a driver layer, etc., for implementing various basic services and handling hardware-based tasks;

a network communication module 452 for communicating to other computing devices via one or more (wired or wireless) network interfaces 420, exemplary network interfaces 420 including: bluetooth, wireless compatibility authentication (WiFi), and Universal Serial Bus (USB), etc.;

a presentation module 453 for enabling presentation of information (e.g., a user interface for operating peripheral devices and displaying content and information) via one or more output devices 431 (e.g., a display screen, speakers, etc.) associated with user interface 430;

an input processing module 454 for medical semantic mining of one or more user inputs or interactions from one of the one or more input devices 432.

In some embodiments, the apparatus provided in the embodiments of the present application may be implemented in software, and fig. 2 illustrates a disease prognosis risk prediction model training apparatus 455 stored in a memory 450, which may be software in the form of programs and plug-ins, and may include the following software modules: a training set construction module 4551, a tree-based model training module 4552, a risk rule determination module 4553 and a predictive model training module 4554, which are logical and thus arbitrarily combined or further split depending on the functions implemented. The functions of the respective modules will be explained below.

Prior to the detailed description of the embodiments of the present application, related technologies related to the embodiments of the present application will be briefly described.

The incidence of heart failure increases with aging population, changes in the disease spectrum, and increased survival rates of various cardiovascular diseases. The prevalence rate of heart failure in developed countries is 1.5% -2.0%, and the prevalence rate of people over 70 years old is over 10%. In China, the population aging is aggravated, and with the increase of the incidence of chronic diseases such as coronary heart disease, hypertension, diabetes, obesity and the like, a plurality of complications occur to patients with heart failure, and the clinical situation is complex. Therefore, a poor prognosis of heart failure cannot be avoided. The 1-year all-cause mortality and 1-year hospitalization rates of patients with chronic stable heart failure were 7.2% and 31.9%, respectively, and those with acute heart failure were 17.4% and 43.9%, respectively.

Several individual biomarkers may be used to predict the prognosis of a heart failure patient, and well known biomarkers may include Brain Natriuretic Peptide (BNP), age, cystatin C, serum uric acid, D-dimer, and the like. Traditional biomarkers of cardiovascular mortality in the general population, such as Body Mass Index (BMI), serum cholesterol and Blood Pressure (BP), were found to correlate with better outcomes in patients with Chronic Heart Failure (CHF). However, due to the complex prognosis of heart failure, the analysis of multiple biomarkers may be more valuable than the analysis of a single biomarker. Machine learning-based Interpretable Predictive Models (IPM) have advantages in delineating interactions between biomarkers, and multiple biomarkers have higher accuracy than a single biomarker for the prognosis of heart failure patients.

Medical knowledge mining aims at extracting meaningful patterns from medical data sets that are expected to provide support for doctors and patients during screening, diagnosis, treatment, prognosis, monitoring management. One popular data source for medical knowledge mining is an Electronic Health Record (EHR), which records patient daily activities at a hospital, such as demographic data, diagnoses, laboratory test results, care records, prescriptions, and the like. Compared to the general application of data mining, research in medical knowledge mining presents some specific difficulties, such as data availability and data standardization.

Cancer, heart disease and diabetes are the most common three diseases, and most of these studies focus on the diagnostic and prognostic stages. Specifically, social network analysis, text mining and time sequence analysis, and high-order feature construction can be applied to medical data analysis. Biomarkers such as gender and blood pressure can also be used to estimate the likelihood of a patient suffering from heart disease, where the recurrent neural network is given a higher accuracy in mortality prediction. In the prior art, most cardiology researches are dedicated to classification technology, and predictive modeling is a popular technology. In the aspect of a specific machine learning algorithm, the accuracy of the algorithm based on the decision tree is high.

The disease prognosis risk prediction model training method provided by the embodiment of the present application will be described below with reference to exemplary applications and implementations of the electronic device provided by the embodiment of the present application.

Referring to fig. 3, fig. 3 is a schematic flow chart of an alternative method for training a disease prognosis risk prediction model provided in the embodiment of the present application, which will be described with reference to the steps shown in fig. 3, and the method for training a disease prognosis risk prediction model provided in the embodiment of the present application can be applied to various diseases such as heart failure, diabetes, renal cyst, and the like, and the type of the disease to which the provided disease prognosis risk prediction model is applied is not limited. This application is exemplified by heart failure only.

Step S101, a training set is constructed based on the health record of the patient.

In some embodiments, the health profile of the hospitalized patient over a period of time may be selected, with each health profile being medical data for one sample. The health record may include basic information of the patient (such as sex, age, marital status, place of birth, etc.), major diseases (such as past medical history, surgical history, allergy history, etc.), summary of health problems, major health service records (such as medication and treatment), etc. The information included in the health profile may be used as features for training a disease prognosis risk prediction model.

In some embodiments, to make the disease prognosis risk prediction model more accurate, health profiles of patients of different age groups and different sexes can be obtained to enrich the variety of samples in the training set.

In some embodiments, the data in the patient's health profile may be structured and standardized using natural language processing tools.

And S102, training the tree-based model by taking the characteristics of each sample in the training set as granularity.

In some embodiments, samples in which the filling rate of the feature variables is greater than a preset threshold may be selected for training. If the parameters included in the health file have corresponding parameter values, the parameters are characterized to be filled; if the parameter included in the health profile does not have a corresponding parameter value (missing value), it is characterized that the parameter is not filled. The preset threshold value can be flexibly set according to actual conditions, such as set to 80%, or other values. The tree-based model in the application can be an initial recognition model constructed before training, namely a model to be trained; the tree-based model may be a classification tree model.

In some embodiments, the characteristics of each sample may include at least: patient basic information (such as age and gender), lifestyle habits (such as smoking and drinking), past medical history (such as complications and surgery), etiology, vital signs, routine laboratory examinations, interventions, and hospitalizations. By way of example, the number of features of each sample may be 73, or another number, and the number of features of the samples is not limited in the embodiments of the present application.

In some embodiments, the sample features may group features according to attributes of the feature values.In the present embodiment, a part of the continuous features are divided into three groups, such as a low group, a normal group, and a high group; where a continuous feature may refer to a feature that is randomly available over a length, the value is uninterrupted. For example, the normal range of leukocytes is [3.5-9.5 ]]10⁹/L, if leukocytes<3.5, the white blood cells are transferred and grouped to "low". Dividing another portion of the continuous features into two groups, e.g., normal group and higher group, e.g., Basophils (BASO), the normal range of BASO being (0, 0.06)]10⁹L, into two categories, normal and higher.

Each sample feature may be represented in a one-hot feature, and examples of the one-hot feature may be as shown in table 1. After normalization, the missing values are filled in by the mean (for consecutive features such as age) or mode (for one-hot features).

TABLE 1

Feature(s)	one-hot	Normal range	Records
				WBC low	(1,0,0)	[3.5，9.5]	WBC<3.5
WBC Normal	(0,1,0)	[3.5，9.5]	3.5≤WBC≤9.5
				WBC high	(0,0,1)	[3.5，9.5]	WBC>9.5
Normal of BASO	(1,0)	(0,0.06]	BASO≤0.06
				High BASO content	(0,1)	(0,0.06]	BASO>0.06

In some embodiments, the risk of each sample is labeled according to the risk characteristic parameter in the sample to determine the risk type of the sample, wherein the risk characteristic parameter can be a specific disease development trend or a characteristic value corresponding to a disease characteristic, and the risk type comprises high risk, medium risk and low risk. Optionally, the samples may be risk-labeled according to the death characteristic (whether death occurs) and the healing characteristic (whether healing occurs) of each sample in the training set, that is, the death characteristic or the healing characteristic is used as a risk characteristic parameter. As an example, a sample is flagged as high risk if there is a clear death record sample within one year; if there are recurrent samples within a year, the samples are marked as medium risk; samples were flagged as low risk if there were no recurrence and no mortality records within one year. Wherein a high risk may be indicative of a high prognostic mortality of the patient and a low risk may be indicative of a low prognostic mortality of the patient.

In some embodiments, the sample features are input into a tree-based model, and the predicted risk type corresponding to the sample is determined from the model output values. The predicted risk type is an output value obtained based on a tree-based model in the process of training the model. For example, if the output value of the tree-based model is between 0.7-1, the sample is determined to be of a high risk type, if the output value is between 0.4-0.7, the sample is determined to be of a medium risk type, and if the output value is between 0-0.4, the sample is determined to be of a low risk type. After determining the predicted risk type, comparing the predicted risk type with the labeled risk label of the sample, and adjusting parameters of the tree-based model according to the difference to train the tree-based model. As an example, the tree-based model may be implemented as a gradient boosting based algorithm.

In some embodiments, when the tree-based model is trained, a gradient Boosting classifier can be fitted, parameters are adjusted according to the prediction result of the classifier, and the learning rate is set to 0.1. In some embodiments, a plurality of weak classifiers may be trained, for example, 100 weak classifiers may be trained, and if the depth of the classifier is set to 3, the number of features to be considered in deciding each segmentation is set to 7, and the threshold for segmenting a sample into nodes is 700. It should be noted that, when the parameters are adjusted by using the weak classifier, the parameters may be set according to experience.

Step S103, determining a plurality of risk rules according to the path of the tree-based model, wherein each risk rule comprises at least two-dimensional features.

Specifically, the tree-based model includes a plurality of decision trees, each decision tree has a plurality of paths formed by root nodes and child nodes, and in order to study the influence of each feature in the sample feature set on the disease risk, a risk rule is established according to the paths from the root nodes, the child nodes to the leaf nodes in the application, and the influence of the features in different nodes on the risk prediction model is established. The risk rule may refer to a rule having a significant influence on the mortality of a patient for a certain disease, and the risk rule may include a plurality of features corresponding to medical data.

In some embodiments, risk rules may be determined from the tree-based model using a rule fitting (Rulefit) algorithm. In particular implementations, the risk rules may be determined based on paths of a model of the tree. As an example, the tree-based model starts from a root node, judges each feature of an instance, allocates the instance to a child node according to the judgment result, at this time, each node corresponds to a value of the feature, and recursively judges and allocates the instance in such a way until the instance is allocated to a leaf node.

In some embodiments, any path of a root node in the tree-based model may be converted into a risk rule; for example, the depth of the classifier is set to 3, and the risk rule can be expressed by a physical formula as:

IF x1>A1 and x2＝A2 and x3＝A3 THEN 1ELSE 0。

namely: if the eigenvalue corresponding to the node X1 is greater than the first preset threshold a1, the eigenvalue corresponding to the node X2 is equal to the second preset threshold a2, and the eigenvalue corresponding to the node X3 is A3, the classification result corresponding to the risk rule is 1; otherwise, the classification result corresponding to the risk rule is 0.

Specifically, the risk rule type (i.e. the value corresponding to the risk rule) r in the present application_mMay be represented by 1 or 0. As shown in the following equation (1):

where I () is a decision function that will be set to 0 if the value of the jth feature x is not within the rule generated by the mth tree, and vice versa, Tm represents the number of features included in the mth tree.

In some embodiments, an example of a patient risk rule is shown in equation (2) below:

r256(x)＝I(is digoxin)I(BIL is high)I(HGB is not low) (2)

as an example, if the three conditions "digoxin (digoxin), Bilirubin (BIL) high, and Hemoglobin (HGB) not low" are met, the risk rule 256 will be set to 1 (high risk), otherwise to 0 (low risk). Accordingly, if the patient satisfies the risk rules: with the drug digoxin and with high BIL and low HGB, the probability of the patient dying after prognosis is high.

In some embodiments of the present application, the path from the root node to each leaf node in the tree-based model may be summarized into a risk rule; each risk rule may include features in two or more dimensions, and thus the risk rules are constructed by establishing a non-linear relationship between the different features. The characteristics included in the risk rules can be characteristics included in the samples in the training set, such as basic information of the patient (such as sex, age, marital status, birth place and the like), main diseases (such as a past medical history, an operation history, an allergy history and the like), health problem summaries, and main health service records (such as medication conditions, treatment conditions and the like).

And step S104, training the tree-based model based on the risk rules obtained in the manner and the sample characteristics of each sample to obtain a disease prognosis risk prediction model, so that the disease risk is predicted according to the disease prognosis risk prediction model.

Here, the prediction of the disease risk in the present application is only to predict the magnitude or trend of the characteristic value in the disease, and the prediction result can provide reference for the health assessment of the patient by the doctor, and has no direct influence on the disease diagnosis and treatment.

In some embodiments, all risk rules determined in step S103 are combined with the features of the samples in step S102 to construct a new training set; and (4) training a disease prognosis risk prediction model by taking the characteristics of each sample in the new training set as granularity, for example, by taking the risk rule and the characteristics of each sample in the step S102 as granularity. Wherein, the disease prognosis risk prediction model can be a linear model.

In some embodiments, an alternative process for training a disease prognosis risk prediction model can be as shown in fig. 4, which comprises at least:

step S104a, inputting the risk rule and the characteristics of each sample in the training set into the disease prognosis risk prediction model, and obtaining the disease prognosis risk prediction result output by the disease prognosis risk prediction model.

In some embodiments, the risk rules may be used as characteristics of the sample to train a disease prognosis risk prediction model. Because two or more features are included in the risk rule, the risk rule can embody a non-linear association between different features. And training a disease prognosis risk prediction model by taking the risk rule as granularity, so that the disease prognosis risk prediction model can predict disease prognosis risk based on the nonlinear relation among different characteristics, and the accuracy of prognosis risk prediction is improved.

Step S104b, determining the difference between the disease prognosis risk prediction result and the risk marker.

In some embodiments, the disease prognosis risk prediction result is a prediction result obtained based on a trained disease prognosis risk prediction model, and the prediction result may be the same as or different from the risk marker; if the prediction result is the same as the risk marker, the prediction accuracy of the prediction model for representing the disease prognosis risk is higher. If the difference between the prediction result and the risk marker is large, the accuracy of the prediction of the disease prognosis risk prediction model is low, and the parameters of the disease prognosis risk prediction model need to be further adjusted. At this time, the difference between the disease prognosis risk prediction result and the risk marker may be calculated.

Step S104c, adjusting parameters of the disease prognosis risk prediction model based on the difference.

In some embodiments, adjusting a parameter of a disease prognosis risk prediction model based on a difference between the disease prognosis risk prediction result and the risk marker; the accuracy of the disease prognosis risk prediction model with the adjusted parameters for predicting the prognosis risk reaches an expected value or a preset value.

In some embodiments, ten-fold cross validation may be applied to the training set to adjust the parameters of the disease prognosis risk prediction model. As an example, the training set is divided into ten, and 9 training sets are used as training data in turn to train the disease prognosis risk prediction model. And (3) taking 1 part of training set as test data, and verifying the accuracy of the disease prognosis risk prediction result output by the disease prognosis risk prediction model obtained by training 9 parts of training set. When the test data is used for verification, each test can obtain the correct rate (or error rate) of the disease prognosis risk prediction result. The average of the accuracy (or error rate) of the 10 results is used as an estimate of the accuracy of the algorithm. Multiple ten-fold cross validation (e.g., 10 ten-fold cross validation) may also be performed, and then the average value is calculated as an estimate of the accuracy of the algorithm.

In some embodiments of the present application, after training the disease prognosis risk prediction model, the disease prognosis risk prediction model may be tested. As an example, True Positive (TP) data, False Positive (FP) data, True Negative (TN) data, and False Negative (FN) data may be determined from the confusion matrix corresponding to the test set, the sensitivity and specificity of the disease prognosis risk prediction model may be calculated, and the exact row of the disease prognosis risk prediction model may be calculated. In some embodiments, the area under the receiver operating characteristic curve (AUC) may be selected as a performance evaluation indicator for the prediction model of the test set prediction disease prognosis risk.

In order to further verify the advantages of the method for performing the prognostic risk model training by adopting the rule fitting model, three prediction models, namely a logistic regression model, a gradient boosting decision tree model and the rule fitting model, are respectively fitted based on a training set. The performance of the three models, as shown in table 2 below; the results of the logistic regression model and the gradient boosting decision tree model are similar. The overall performance of the rule fitting model is superior to that of the logistic regression model and the gradient boosting decision tree model. The area under the curve (AUC) values of the logistic regression model, the gradient boosting decision tree model, and the rule fitting model were at the same level, all of which were 0.99. For the accuracy of the model, the accuracy of the logistic regression model, the gradient boosting decision tree model and the rule fitting model is over 95%. The accuracy of the logistic regression model and the rule fitting model is 0.98, which is 1 percent higher than the accuracy of the gradient lifting decision tree model. The sensitivity of the rule-fitted model was the highest, 0.97, for the sensitivity and specificity of the model. The specificity of the rule fitting model is the same as that of the logistic regression model and is 0.99; the specificity of the decision tree model is 0.97. It can be seen that the performance parameters of the rule-based fitting model are optimal.

TABLE 2

FIG. 5 shows the performance of the three models, including the ROC curves of the three models, and the AUC values of the three models. Based on fig. 5, it can be determined that the rule-fitting model is the most suitable model for predicting 1-year hospitalization mortality of heart failure patients.

In some embodiments, before the training of the disease prognosis risk prediction model based on the risk rule and the characteristics of each sample in the training set, or after the training of the disease prognosis risk prediction model is completed, the method for training the disease prognosis risk prediction model may further include:

and S105, carrying out rule pruning on the risk rule determined according to the tree-based model, so as to carry out medical knowledge mining according to the pruned risk rule and determine characteristic data influencing the disease.

In some embodiments, the risk rules determined according to the tree-based model are more numerous, and there may be problems that the interpretation effect of some risk rules is not in accordance with the medical logic, or the interpretability is low. In the practical application process, the problem that part of the risk rules have noise or deviation may also exist. Therefore, the rule pruning criterion can be determined according to the actual condition in the medical field, medical guidance and interpretability methods of the feature layer of the disease prognosis risk prediction model.

In some embodiments, an optional process flow of performing rule pruning on the risk rules determined according to the tree-based model may be as shown in fig. 6, and at least includes:

step S105a, traversing the risk rules determined according to the tree-based model, and calculating the number of patients and the number of deaths of the patients corresponding to each risk rule.

In some embodiments, the number of patients and the number of patient deaths for each risk rule determined according to the tree-based model are calculated. The risk rules are determined from the paths of the tree-based model (the paths from the root node to the child nodes).

Step S105b, determining the death proportion, the medical effectiveness, the importance of the risk rule and the coefficient of the risk rule within the first preset time of the patient corresponding to each rule based on the number of patients corresponding to each risk rule and the death number of the patients.

In some embodiments, the first preset time may be flexibly set according to practical applications, such as set to 1 year or half year.

In some embodiments, mortality rates and medical effectiveness are the primary priorities in the risk rule pruning criteria, and accordingly, the risk rules, including mortality rates and medical effectiveness, are more important. Wherein the medical effectiveness can be used to characterize the effectiveness of a cure for a patient. Wherein the death rate is the percentage of the dead people in the total number of people; the medical effectiveness refers to judging whether the risk rule accords with the conventional medical logic according to the existing medical concept; for example, if the risk rule is first surgical treatment and then chemotherapy, the conventional medical logic is not met, and the medical effectiveness is poor; if the risk rule is chemotherapy and then surgery treatment, the traditional medical logic is met, and the medical effectiveness is good. In some embodiments, the importance of the risk rules may be determined using a gradient Boosting tree model, and the coefficients of the risk rules may be determined based on logistic regression.

In some embodiments, the importance of the risk rule may be determined by the following equation (3).

Wherein, I_kIn order for the importance of the risk rules to be significant,

the weight values of the disease prognosis risk prediction model,

the value of (c) can be obtained by training historical data; s is_kTo satisfyThe patient proportion of risk rules; s is_kCan be determined by the following equation (4):

wherein n represents the number of patients, r_kIndicates the risk rule type, which may also be referred to as whether the patient complies with the risk rule, and if so, r_kThe value is 1, if the risk rule is not met, r_kThe value is 0.

In some embodiments, the importance of the risk rules and the coefficients of the risk rules are secondary priorities in the risk rules pruning criteria. The risk rule pruning criteria may be as shown in table 3 below:

TABLE 3

Standard of merit	Importance of	Primary priority
			Death rate ratio	1	Is that
Medical effectiveness	1	Is that
			Importance of Risk rules	2	Whether or not
Of risk rulesCoefficient of performance	3	Whether or not
			Number of deaths	4	Whether or not

As an example, if a risk rule satisfies the mortality rate risk and the medical effectiveness, the risk rule is considered to be of higher importance and needs to be retained. If the coefficient and the number of deaths of the risk rule are within a preset reasonable range, the risk rule is considered to be an interpretable risk rule and meaningful medical knowledge.

Step S105c, pruning the risk rules determined according to the tree-based model based on the death rate, the medical effectiveness, the importance of the risk rules and the coefficients of the risk rules within the first preset time of the patient.

Based on the risk rule pruning criteria shown in table 3 above, if a risk rule meets the death rate risk and the medical effectiveness, the risk rule is considered to have higher importance and needs to be retained; otherwise, the risk rule needs to be deleted. If the coefficient of the risk rule and the death number are in a preset reasonable range, the risk rule is considered to be an interpretable risk rule, is meaningful medical knowledge and needs to be reserved; otherwise the risk rule needs to be deleted. And deleting part of the risk rules to prune the risk rules.

In some embodiments, after training the disease prognosis risk prediction model, the disease prognosis risk prediction model training method may further include:

and S106, mining medical knowledge of each risk rule based on medical parameters to obtain characteristics related to the disease prognosis risk of the patient.

In some embodiments, the risk rules identified in the embodiments of the present application may be free of paradox rules, provided that there is no contradiction between medical guidance or knowledge recognized in the art and the risk rules. As an example, the risk rules determined by embodiments of the present application conform to the actual situation of a real-world patient.

In some embodiments, medical knowledge mining may be performed based on the mortality ratio within a first preset time of the patient, the medical effectiveness, the importance of the risk rule, and the coefficient of the risk rule included in the medical parameters, to derive a feature related to the risk of prognosis of the disease of the patient; from the medical knowledge obtained it can be determined to which features the risk of prognosis of the disease of the patient is associated. In particular implementations, medical knowledge mining may be performed based on the mortality rate, the medical effectiveness, the importance of the risk rules, and the coefficients of the risk rules for a first preset time of a patient.

In some embodiments, it is determined that the medical parameter includes a death rate, medical effectiveness, importance of risk rules, and a coefficient of risk rules within a first preset time of the patient satisfies a first condition, and then the disease prognosis risk corresponding to the feature included in the risk rule corresponding to the medical parameter is high risk; as an example, high risk may refer to death of the patient. Wherein the first condition comprises: the mortality rate of the patient within a first preset time is greater than a first threshold, the medical effectiveness is determined to be less than a second threshold, the importance of the risk rule is greater than a third threshold, and the coefficient of the risk rule is greater than a fourth threshold. The first threshold, the second threshold, the third threshold and the fourth threshold can be flexibly set according to application scenarios.

Determining that the death proportion, the medical effectiveness, the importance of the risk rule and the coefficient of the risk rule in the first preset time of the patient, which are included in the medical parameters, meet second conditions, and determining that the disease prognosis risk corresponding to the characteristics included in the risk rule corresponding to the medical parameters is low risk; as an example, high risk may refer to a patient not dying. Wherein the second condition does not satisfy at least one of: the mortality rate of the patient within a first preset time is greater than a first threshold, the medical effectiveness is determined to be less than a second threshold, the importance of the risk rule is greater than a third threshold, and the coefficient of the risk rule is greater than a fourth threshold. The first threshold, the second threshold, the third threshold and the fourth threshold can be flexibly set according to application scenarios.

In an embodiment of the present application, the samples in the test set are counted, and the average value of the death rates corresponding to the risk rules is counted to be 6.26%, the death rate of 42 risk rules is higher than the average value of the death rates, and the death rate of 26 risk rules is higher than 10%, so that the 26 risk rules need to be further studied. The 46 risk rules are of higher importance (e.g., 0.058 above the mean), 49 risk rules have a factor greater than 0, and 48 risk rules have a higher number of deaths than the average number of deaths 122.

In some embodiments, after performing risk rule pruning based on the statistical data, the risk rule shown in fig. 7 is obtained; among these, some risk rules have been validated, as shown in fig. 7 for the second highest mortality rate risk rule: "sex: female & hs-cTnI high: is & UA Normal: no, "this risk rule may be interpreted as a female patient whose risk of death may be higher when hs-cTnI is high and uric acid is abnormal. Among them, hs-cTnI is a useful biomarker for patients with heart failure, and uric acid is an important prognostic marker for the all-cause mortality of patients with heart failure; it is therefore meaningful to determine the risk rule; however, since the medical semantics in the risk rules have been previously proven, decision suggestions and inspiration for the physician are not significant.

In some embodiments, some risk rules not only conform to medical common sense, but may also relate to the actually produced results, which is not appreciated or understood in the art, which provides a reality for medical research. As an example, the risk rule: "low percentage of monocytes: no & urea high: is & CK-MB high: that is, "the risk rule may be interpreted as if urea and creatine kinase MB (CK-MB) levels are higher when a heart failure patient is diagnosed with heart failure, then it is considered likely that a higher concern will be whether the monocyte percentage is also lower. Although urea has been suggested in the prior art to predict mortality in heart failure patients and the involvement of monocytes in the pathogenesis of cardiovascular disease, the effect on mortality in heart failure patients is unclear.

As an example, the risk rule: "ALP high: is & ChE low: is & gender: in men, "this risk rule may be interpreted as that for HF male patients, higher than normal alkaline phosphatase (ALP) and lower than normal cholinesterase (ChE) may be high risk multifactorial factors affecting one year mortality. In the prior art, no special attention has been paid to the biochemical characteristics of alkaline phosphatase and cholinesterase. The embodiment of the application provides support for the prognostic risk of heart failure patients by two biochemical characteristics of alkaline phosphatase and cholinesterase.

As an example, the risk rule: "PLT Normal: is & age >81.5 "is: the risk rule may be interpreted as that for heart failure patients over 82 years of age, age is a high risk factor for such patients, and it has been demonstrated that the mortality rate for patients over 82 years of age is high in compliance with medical logic; the risk rule "Platelets (PLTs) in normal range" may be a risk factor for death (>82) in elderly heart failure patients due to the existence of "reverse epidemiology" or "risk factor paradox" theories. Among these, some of the high risk factors for heart failure, such as body mass index and blood pressure, have better prognostic outcomes in congestive heart failure patients. Thus, the annual mortality risk of platelets is to be further validated for elderly patients.

In the embodiment of the application, a disease prognosis risk prediction model can be as shown in fig. 8, and a classifier of a gradient Boosting framework is fitted first, and a training result phi j of a plurality of regression tree models (classification regression trees CART) is multiplied by a corresponding weight value theta j to integrate a sample prediction result f (x); the regression tree models behind the classifier learn errors on the basis of the regression tree models in front of the classifier, and the results of all the regression tree models are accumulated to obtain the result of prognosis risk prediction.

After the risk prediction result is obtained, parameters of the disease prognosis risk prediction model can be adjusted according to the risk prediction result, and the Area Under the Curve (AUC) result of the disease prognosis risk prediction model in the training set is used as an optimization target in a grid search mode. First, at the level of the boost framing framework: the model is optimized through different learning rates and the number of weak classifiers, the learning rate can be set to be 0.1, and the number of the weak classifiers is 100. The subsample may be set to 0.8 by a subsample (subsample) method, i.e., not put back samples to prevent overfitting; for the loss function (loss) optimization method, a form of log-likelihood loss function may be adopted. Secondly, at the weak classifier level, max _ feature may be set to 7, max _ depth may be set to 3 layers, and min _ samples _ split may be set to 700 based on factors such as the maximum number of features (max _ feature) to be considered when node partitioning is performed, the maximum depth (max _ depth) of the decision tree, the number of samples required for internal node repartitioning (min _ samples _ split), the minimum number of samples of leaf nodes (min _ samples _ leaf), and the maximum number of leaf nodes. Finally, the risk rule obtained by the model is used as an additional feature set, is combined with the original feature set, and is input into a logistic regression linear model for model training to obtain a disease prognosis risk prediction model; the function of the disease prognosis risk prediction model can be expressed as:

wherein the content of the first and second substances,

the result of the training of each learner is represented,

is the raw feature set and risk rule, θ_jAnd identifying the weight value corresponding to the learner, wherein b represents the number of the learners.

Therefore, the disease prognosis risk prediction model training method and the medical knowledge mining method provided by the embodiment of the application can verify the effectiveness of the mined medical knowledge, and the mined medical knowledge not only confirms the known medical attempt, but also provides supporting evidence for the unverified hypothesis and provides the basis for clinical treatment and diagnosis for medical workers.

The embodiment of the application also provides a disease prognosis risk prediction method, which comprises the following steps:

a health profile of a first patient is obtained, wherein the health profile includes patient basic information (e.g., age and gender), lifestyle habits (e.g., smoking and drinking), past medical history (e.g., complications and surgery), etiology, vital signs, routine laboratory examinations, interventions, and hospitalizations. Wherein, the health record is the medical data corresponding to the patient.

And inputting the health record of the first patient into the trained disease prognosis risk prediction model to obtain a prognosis risk prediction value, wherein the disease prognosis risk prediction model is obtained by the training method of the disease prognosis risk prediction model provided by the embodiment.

Wherein the prognostic risk prediction value can be 0 or 1, 1 represents high risk of disease or high probability of death of the patient determined from the data characteristic of the patient, whereas 0 represents low risk of disease or low probability of death of the patient determined from the data characteristic of the patient.

Determining a risk rule affecting the risk prediction value based on the disease prognosis risk prediction model.

Based on the corresponding feature data in the risk rule, a plurality of feature data affecting the disease is determined.

Wherein the risk rule is determined according to a tree-based model; the method further comprises the following steps: constructing a training set based on the health records of the plurality of second patient samples; taking the characteristics of each sample in the training set as the input of the tree-based model, and determining the output of the tree-based model as a risk rule sample; calculating a difference between the true risk rule for the risk rule sample and the second patient sample; adjusting a parameter of the tree-based model based on the difference.

Wherein the disease prognosis risk of the first patient is determined based on a disease prognosis risk prediction model; the method further comprises the following steps: inputting the risk rule sample and the characteristic sample corresponding to the health file of the second patient sample into the disease prognosis risk prediction model to obtain a disease prognosis risk prediction result output by the disease prognosis risk prediction model; determining a difference between the disease prognosis risk prediction result and a risk marker; adjusting a parameter of the disease prognosis risk prediction model based on the difference.

Continuing with the exemplary structure of the disease prognosis risk prediction model training device 455 provided by the embodiments of the present application implemented as software modules, in some embodiments, as shown in fig. 2, the software modules stored in the disease prognosis risk prediction model training device 455 of the memory 450 may include: a training set construction module 4551 configured to construct a training set based on the health profile of the patient; a tree-based model training module 4552, configured to train the tree-based model with the feature of each sample in the training set as a granularity; a risk rule determining module 4553 for determining risk rules from paths of the tree-based model, each of the risk rules comprising features of at least two dimensions; a prediction model training module 4554 configured to train a disease prognosis risk prediction model based on the risk rule and the characteristics of each sample in the training set, where the disease prognosis risk prediction model has an attribute of predicting the disease prognosis risk of the patient.

In some embodiments, the disease prognosis risk prediction model training device 455 further includes: the medical knowledge mining 4555 is used for performing medical knowledge mining on each risk rule based on medical parameters to obtain the relationship between characteristics and disease prognosis risks;

the medical parameters include at least: the patient's mortality rate, medical effectiveness, importance of risk rules, and coefficients of risk rules within a first preset time.

In some embodiments, the medical knowledge mining module 4555 is configured to determine that the disease prognosis risk corresponding to the feature included in the risk rule corresponding to the medical parameter is a high risk if the medical parameter satisfies the first condition.

In some embodiments, the first condition comprises: the mortality rate of the patient within a first preset time is greater than a first threshold, the medical effectiveness is determined to be less than a second threshold, the importance of the risk rule is greater than a third threshold, and the coefficient of the risk rule is greater than a fourth threshold.

In some embodiments, the medical knowledge mining module 4555 is configured to determine that the disease prognosis risk corresponding to the feature included in the risk rule corresponding to the medical parameter is low risk if the death rate, the medical effectiveness, the importance of the risk rule, and the coefficient of the risk rule included in the medical parameter within the first preset time of the patient satisfy the second condition.

In some embodiments, the second condition does not satisfy at least one of: the mortality rate of the patient within a first preset time is greater than a first threshold, the medical effectiveness is determined to be less than a second threshold, the importance of the risk rule is greater than a third threshold, and the coefficient of the risk rule is greater than a fourth threshold.

In some embodiments, the tree-based model training module 4552 is configured to perform rule pruning on the risk rules determined according to the tree-based model to obtain the risk rules for training the disease prognosis risk prediction model.

In some embodiments, the tree-based model training module 4552 is configured to traverse the risk rules determined according to the tree-based model, and calculate the number of patients and the number of patient deaths corresponding to each risk rule;

determining the death proportion, the medical effectiveness, the importance of risk rules and the coefficients of risk rules of the patient within a first preset time based on the number of patients and the number of deaths of the patient;

pruning the risk rules determined according to the tree-based model based on the mortality proportion, the medical effectiveness, the importance of the risk rules, and the coefficients of the risk rules for the patient within a first preset time.

In some embodiments, the prediction model training module 4554 is configured to input the risk rule and the characteristics of each sample in the training set into the disease prognosis risk prediction model, and obtain a disease prognosis risk prediction result output by the disease prognosis risk prediction model;

determining a difference between the disease prognosis risk prediction result and a risk marker;

The embodiment of the present application further provides a disease prognosis risk prediction device, which includes:

wherein the disease prognosis risk prediction model is obtained by training based on any one of the methods provided in the examples of the application.

Embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. A processor of the computer device reads the computer instructions from the computer-readable storage medium, and executes the computer instructions, so that the computer device executes the disease prognosis risk prediction model training method described above in the embodiment of the present application; alternatively, the disease prognosis risk prediction method described above in the embodiments of the present application is performed.

Embodiments of the present application provide a computer-readable storage medium storing executable instructions, which when executed by a processor, will cause the processor to perform a method provided by embodiments of the present application, for example, a disease prognosis risk prediction model training method or a disease prognosis risk prediction method as shown in fig. 3 to 8.

In some embodiments, the computer-readable storage medium may be memory such as FRAM, ROM, PROM, EPROM, EEPROM, flash, magnetic surface memory, optical disk, or CD-ROM; or may be various devices including one or any combination of the above memories.

In some embodiments, executable instructions may be written in any form of programming language (including compiled or interpreted languages), in the form of programs, software modules, scripts or code, and may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.

By way of example, executable instructions may correspond, but do not necessarily have to correspond, to files in a file system, and may be stored in a portion of a file that holds other programs or data, such as in one or more scripts in a HyperText Markup Language (HTML) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).

By way of example, executable instructions may be deployed to be executed on one computing device or on multiple computing devices at one site or distributed across multiple sites and interconnected by a communication network.

The above description is only an example of the present application, and is not intended to limit the scope of the present application. Any modification, equivalent replacement, and improvement made within the spirit and scope of the present application are included in the protection scope of the present application.

Claims

1. A method for training a disease prognosis risk prediction model, wherein the method comprises the following steps:

performing regular pruning on the plurality of risk rules;

2. The method of claim 1, wherein said pruning the plurality of risk rules comprises:

determining survival proportion, medical effectiveness and importance of risk rules of the patients corresponding to each risk rule within a first preset time based on the number of the patients corresponding to each risk rule and the number of the dead patients;

pruning the risk rules determined according to the tree-based model based on the survival proportion of the patient within a first preset time, the medical effectiveness and the importance of the risk rules.

3. The method of claim 2, wherein the importance of the risk rule is related to the proportion of patients who meet the risk rule.

4. The method of claim 1, further comprising:

5. The method of claim 4, wherein said performing a medical knowledge mining analysis on each of said risk rules based on medical parameters, deriving a feature-to-disease prognosis risk relationship comprises:

if the medical parameter meets a first condition, determining that the disease prognosis risk corresponding to the feature included in the risk rule corresponding to the medical parameter is high risk; otherwise, it is determined to be at low risk.

6. The method of claim 5, wherein the first condition comprises:

7. The method according to any one of claims 1 to 6, wherein the model training according to the medical data of each sample and the plurality of risk rules after pruning comprises:

8. A method for predicting disease prognosis risk, the method comprising:

wherein the disease prognosis risk prediction model is trained based on the method of any one of claims 1 to 7.

9. A disease prognosis risk prediction model training device, characterized by comprising:

a risk rule determining module for training a tree-based model with the characteristics of each sample as granularity; determining a plurality of risk rules according to a path of the tree-based model, each risk rule comprising features of at least two different dimensions; performing regular pruning on the plurality of risk rules;

10. A disease prognosis risk prediction apparatus, characterized by comprising:

11. An electronic device, comprising:

a memory for storing executable instructions;

a processor for implementing the disease prognosis risk prediction model training method of any one of claims 1 to 7 when executing the executable instructions stored in the memory;

alternatively, the method for predicting the risk of prognosis of a disease according to claim 8 is carried out.

12. A computer-readable storage medium storing executable instructions for implementing the disease prognosis risk prediction model training method according to any one of claims 1 to 7 when executed by a processor;