CN112270451A

CN112270451A - Monitoring and early warning method and system based on reinforcement learning

Info

Publication number: CN112270451A
Application number: CN202011217940.8A
Authority: CN
Inventors: 陈芋文; 张矩; 钟坤华; 孙启龙; 林小光; 刘江
Original assignee: Chongqing Institute of Green and Intelligent Technology of CAS
Current assignee: Chongqing Institute of Green and Intelligent Technology of CAS
Priority date: 2020-11-04
Filing date: 2020-11-04
Publication date: 2021-01-26
Anticipated expiration: 2040-11-04
Also published as: CN112270451B

Abstract

The invention provides a monitoring and early warning method and a system based on reinforcement learning, which comprises the following steps: predicting the incidence relation between the time sequence monitoring data and the adverse event label according to the time sequence monitoring data input in real time, and creating a decision environment; modeling the agent decision-making action; the intelligent agent selects a decision action according to the time sequence monitoring data input at the current moment; the decision environment outputs response information according to the decision action, wherein the response information comprises environment states and reward and punishment values of the decision action; inputting the environment state into a pre-constructed deep reinforcement learning framework, and acquiring the action with the highest expected value in all selectable decision actions of the intelligent agent as the output of the next action decision of the intelligent agent; interacting the intelligent agent and the decision-making environment according to the steps until the end condition is met, and outputting a prediction result; the invention monitors the condition of the target object in real time through reinforcement learning and improves the timeliness of problem processing.

Description

Monitoring and early warning method and system based on reinforcement learning

Technical Field

The invention relates to the field of intelligent medical treatment, in particular to a monitoring and early warning method and system based on reinforcement learning.

Background

The current research is mainly based on the prediction of critical adverse events in a supervised learning mode, and mainly comprises a logistic regression algorithm, a random decision tree algorithm, a deep neural network algorithm and the like. The supervised learning is usually one-time and short-sight, and the instant reward is considered, the correlation between the prediction accuracy and a data set is large, the generalization performance of a model is not strong, and a huge labeled data set is required based on the supervised learning mode, particularly a deep neural network algorithm; the pre-labeling of critical data sets is done at a high cost and effort (key step), while the labeling based on medical data requires a high time from a sophisticated medical specialist, which is costly and expensive.

Disclosure of Invention

In view of the problems in the prior art, the invention provides a monitoring and early warning method and system based on reinforcement learning, and mainly solves the problem that adverse event discovery is delayed due to lack of early diagnosis and early warning in the perioperative period.

In order to achieve the above and other objects, the present invention adopts the following technical solutions.

A monitoring and early warning method based on reinforcement learning comprises the following steps:

predicting the incidence relation between the time sequence monitoring data and the adverse event label according to the time sequence monitoring data input in real time, and creating a decision environment;

modeling the agent decision-making action, wherein the decision-making action comprises waiting for the time-sequence monitoring data input of the next time node or outputting a predicted adverse event label;

the intelligent agent selects a decision action according to the time sequence monitoring data input at the current moment; the decision environment outputs response information according to the decision action, wherein the response information comprises environment states and reward and punishment values of the decision action;

inputting the environment state into a pre-constructed deep reinforcement learning framework, and acquiring the action with the highest expected value in all selectable decision actions of the intelligent agent as the output of the next action decision of the intelligent agent;

and interacting the intelligent agent and the decision-making environment according to the steps until an ending condition is met, and outputting a prediction result.

Optionally, the end condition includes completing prediction of all time-series monitored data within the monitored duration or outputting an adverse event tag.

Optionally, the selecting, by the agent, a decision action according to the time-series monitored data input at the current time includes:

setting a selection strategy of the agent, and selecting a decision action according to the selection strategy, wherein the selection strategy comprises the following steps: randomly or according to a preset probability.

Optionally, the decision environment outputs response information according to the decision action, including:

when the decision action is to wait for the time sequence monitoring data of the next time node to be input, a decision environment acquires the time sequence monitoring data of the next time, predicts the incidence relation between the time sequence monitoring data of the next time and the adverse event label and outputs an environment state corresponding to the time sequence monitoring data of the next time;

when the decision action is used for outputting the predicted adverse event label, the decision environment acquires the time sequence monitoring data at the current moment, predicts the incidence relation between the time sequence monitoring data at the current moment and the adverse event label, outputs a reward and punishment value of the decision action, and judges whether the adverse event label predicted by the intelligent body is correct or not according to the reward and punishment value.

Optionally, the method further comprises constructing a reward-penalty utility function, and the decision environment outputs a reward-penalty value of the decision action according to the reward-penalty utility function.

Optionally, the reward penalty utility function comprises:

wherein R (a)_t，M_：tL) represents the correlation of the decision-making action with the corresponding time-series monitored data; a is_tIs a decision-making action; m_：tA time-series monitoring data subset for a t-time node; p is greater than 0 and is the dimension of time sequence monitoring data;

is a compromise parameter for advance predictability and accuracy; predict label is an adverse event expected to be predicted; u-shaped_k∈L\lThe predict label k is a mispredicted adverse event.

Optionally, comprising:

and constructing an evaluation function, evaluating the decision environment, and adjusting the reward and punishment utility function according to an evaluation result.

Optionally, the evaluation function is represented by:

wherein, C represents an incidence relation prediction model for predicting the time sequence monitoring data and the adverse event label, D' is a test data set, and l is the adverse event label; # denotes the number of data in the set.

A reinforcement learning-based monitoring and early warning system comprises:

the environment modeling module is used for predicting the incidence relation between the time sequence monitoring data and the adverse event label according to the time sequence monitoring data input in real time and creating a decision environment;

the action modeling module is used for modeling the decision action of the intelligent agent, wherein the decision action comprises waiting for the time sequence monitoring data input of the next time node or outputting a predicted adverse event label;

the environment response module is used for selecting decision-making action by the intelligent agent according to the time sequence monitoring data input at the current moment; the decision environment outputs response information according to the decision action, wherein the response information comprises environment states and reward and punishment values of the decision action;

the reinforcement learning module is used for inputting the environment state into a pre-constructed depth reinforcement learning framework, and acquiring the action with the highest expected value in all selectable decision actions of the intelligent agent as the output of the next action decision of the intelligent agent;

and the interactive prediction module is used for interacting the intelligent agent and the decision-making environment according to the steps until the end condition is met and outputting a prediction result.

As described above, the monitoring and early warning method and system based on reinforcement learning of the present invention have the following advantages.

Early warning is carried out on perioperative target objects through a real-time online early warning method, timeliness of problem finding and problem handling is improved, and safety of the target objects is guaranteed.

Drawings

Fig. 1 is a flowchart of a reinforcement learning-based monitoring and early warning method according to an embodiment of the present invention.

Fig. 2 is a schematic interaction flow diagram of a reinforcement learning-based monitoring and early warning method according to an embodiment of the present invention.

FIG. 3 is a schematic diagram of a reinforcement learning process according to an embodiment of the present invention.

Fig. 4 is a block diagram of a monitoring and early warning system based on a reinforcement learning method according to an embodiment of the present invention.

Detailed Description

The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict.

It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention, and the components related to the present invention are only shown in the drawings rather than drawn according to the number, shape and size of the components in actual implementation, and the type, quantity and proportion of the components in actual implementation may be changed freely, and the layout of the components may be more complicated.

Referring to fig. 1, the present invention provides a reinforcement learning-based monitoring and early warning method, which includes steps S01-S05.

In step S01, according to the time-series monitored data input in real time, the association relationship between the time-series monitored data and the adverse event label is predicted, and a decision environment is created:

in an embodiment, a cardiovascular disease patient can be used as a target object, and different monitoring devices are used to monitor multidimensional cardiovascular indicators of the target object respectively, so as to obtain time-series monitoring data including multiple dimensions. Each monitoring device can acquire a group of time-series monitoring data, and the time-series monitoring data acquired by different time nodes can be used for constructing a complete monitoring data set of the target object in the whole perioperative period. Optionally, the time-series monitoring data of a specific time period in the perioperative period may also be used to construct a complete monitoring data set of the patient, and the specific time setting may be adjusted according to the actual application requirements.

In an embodiment, each monitoring device may be respectively docked with the medical system, and output the acquired time-series monitoring data to the medical system, and further, the medical system uses the multi-dimensional time-series monitoring data of the same target object to construct a multi-dimensional monitoring data set. In particular, assume M ∈ R^pxTIs a monitoring data set of a patient, and the patient monitoring variables have p, namely, time sequence monitoring data with p dimensions; the monitoring duration is T; m is the set of all monitoring time sequence data in the monitoring duration T, M_：tRepresenting a set of all time sequence monitoring data before the time T, wherein the set is a subset of the M monitoring data set, and T is less than T;

the monitoring data is the p-dimension time sequence monitoring data corresponding to the time node 1. M and M_：tRespectively, as follows:

in one embodiment, a set of labels L for cardiovascular adverse events may be constructed, each label corresponding to an adverse event, and further, a subset M of monitored data for each time node t may be constructed_：tCorrelating with cardiovascular adverse event signatures, resulting in each M_：tThe status of (2) is shown. In one embodiment, a prediction model for predicting the state of the output time-series monitored data can be obtained based on the correlation between the time-series monitored data of different time nodes and the adverse cardiovascular event. And modeling the decision environment according to the output of the prediction model, and inputting the time sequence monitoring data into the decision environment to predict the corresponding output response information.

In one embodiment, D represents the cardiovascular critical failure data set predicted by the prediction model D, and C is a prediction model for determining the corresponding relationship between the state of the time-series monitored data and the tag data set. Specific formulae of D and C are as follows:

D＝{(mⁱ，lⁱ)_i＝1...n|mⁱ∈M，lⁱ∈L}

mⁱ∈M lⁱe.L is C: m → L

In an embodiment, an evaluation function may be further constructed to evaluate the prediction accuracy of the prediction model, and specifically, the following formula may be designed to evaluate the performance of the prediction model C, where D' is a test data set and # represents the number of data in the set.

In step S02, modeling the agent decision-making action, wherein the decision-making action includes waiting for a time-series monitored data input of a next time node or outputting a predicted adverse event label;

in an embodiment, according to the decision environment constructed in step S01, a decision action of the agent in the decision environment is further constructed.

In one embodiment, the time-series monitoring data at the current moment is input into an Agent in advance to make action decision. The Agent selects a decision action according to its own selection strategy. Specifically, the selection strategy may be a random selection or a selection according to a preset probability (e.g., an epsilon-greedy strategy, etc.), and may be set according to an actual application requirement, which is not limited herein.

Specifically, the decision action may be expressed as:

wherein wait represents the time sequence monitoring data waiting for the next time node, U_k∈LThe prediction label k represents the adverse event label predicted by the Agent of the Agent.

The Agent of the intelligent Agent monitors the state of the time sequence data according to the current time node and the strategy pi of the Agent_ΘSelecting a decision behavior a_tThe formula is as follows:

a_t＝π_Θ(O_t)

wherein, O_t＝M_：t。

In step S03, the agent selects a decision-making action according to the time-series monitored data input at the current moment; and the decision environment outputs response information according to the decision action, wherein the response information comprises environment states and reward and punishment values of the decision action.

In an embodiment, a reward-penalty utility function may be constructed, and the decision environment outputs a reward-penalty value of the decision action according to the reward-penalty utility function. The reward and punishment utility function is an important component of intelligent Agent training, encodes the task of the intelligent Agent and directly influences the behavior of the intelligent Agent. The design of the reward and punishment utility function needs to reflect an optimization target, the early prediction of the cardiovascular critical event aims to recognize the premonitory sign of the critical event as early as possible by the intelligent Agent, and the early warning accuracy is ensured. If the false alarm rate is high, unnecessary workload can be generated for medical staff. The reward and punishment utility function is to map the decision behavior of the intelligent Agent to a real number space R on the monitoring data set: and A, D → R, learning the correlation between the decision environment and the behavior corresponding to the time sequence monitoring data observed by the intelligent Agent. Namely: r is_t＝R(a_t，M_：t，l)。

In one embodiment, the reward and punishment utility function is designed by balancing accuracy (c) and advance predictability of the intelligent Agent for predicting the disease deterioration based on the monitored data set (M, l) ∈ D with a label: the specific reward and punishment utility function formula is as follows:

wherein p is greater than 0, and the compound is,

is a compromise parameter for advance predictability and accuracy.

The designed reward and punishment function can obtain positive reward when the Agent of the intelligent Agent correctly predicts the cardiovascular critical adverse event, the wrong prediction can be punished, and the corresponding punishment can be obtained when the early warning is delayed.

In one embodiment, when the decision action is waiting for the time sequence monitoring data of the next time node to be input, the decision environment acquires the time sequence monitoring data of the next time, predicts the association relationship between the time sequence monitoring data of the next time and the adverse event label, and outputs the environment state corresponding to the time sequence monitoring data of the next time;

when the decision action is outputting the predicted adverse event label, the decision environment acquires the time sequence monitoring data at the current moment, predicts the incidence relation between the time sequence monitoring data at the current moment and the adverse event label, outputs a reward and punishment value of the decision action, and judges whether the adverse event label predicted by the intelligent body is correct or not according to the reward and punishment value.

In step S04, inputting the environment state into a pre-constructed deep reinforcement learning framework, and obtaining the action with the highest expected value in all selectable decision actions of the agent as the output of the next action decision of the agent;

in one embodiment, the cardiovascular critical adverse event prediction is converted into a Markov decision problem, and an optimal decision action is selected to be output based on the relevance of the environment state and the decision action of the reinforcement learning framework learning.

Referring to fig. 3, in an embodiment, deep reinforcement learning combining the sequential convolutional network and the Q learning method can be used to model the exacerbation of the cardiovascular critical patient. And training the model by adopting a deep Q reinforcement learning framework.

Specifically, a monitoring data set unit 05 is constructed, time sequence monitoring data obtained by interaction between the Agent of each time node and a decision environment is stored in the monitoring data set unit 05, and a certain amount of time sequence monitoring data is randomly taken out from the monitoring data set unit 05 for training during training so as to solve the problems of data correlation and non-static distribution.

And using the current time sequence convolution network 04 to evaluate an estimation value function corresponding to each possible decision action of the Agent in the environment state of the decision environment output. The evaluation value function is used for evaluating expected reward of decision-making action in a preset long-term time. The reward and punishment value at the current moment is high and does not represent that the expected reward in a long term is also high. The effectiveness of the decision-making action can be comprehensively evaluated through an estimation value function. The manner in which the expected value is calculated is not limited herein.

And the target value time sequence convolution network 06 is used for evaluating a true value function corresponding to the time sequence monitoring data, and the parameter of the current time sequence convolution network 04 is updated by adopting a gradient descent method according to the error between the true value function and the estimation value function. The parameters of the current timing convolutional network 04 are copied to the target value timing convolutional network 06 every N iterations, and the current timing convolutional network 04 and the target value timing convolutional network 06 may use the same network.

And finally, obtaining the decision action with the highest expected value of the Agent as the optimal decision, and outputting the decision action.

In step S05, the agent interacts with the decision environment according to the above steps until the end condition is satisfied, and a prediction result is output.

Referring to fig. 2, in one embodiment, the predicting Agent02 interacts with the vital signs monitoring data environment (e.g., vital signs monitoring device) of the critical patient at each time to obtain a high-dimensional monitored sign data (i.e., p-dimensional time-series monitored data).

And (3) correlating the time sequence monitoring data with the adverse events by using a prediction model 01 to obtain reward and punishment values of the environment state representation and the decision action of the intelligent agent.

Feeding the environment state back to the Agent02 of the intelligent Agent, and learning the decision action which the intelligent Agent may make aiming at the golden state and the expected value corresponding to the decision action through a reinforcement learning framework; outputting a decision action with the highest expected value;

and the Agent02 feeds the decision-making action back to the decision-making environment 03 to obtain response information of the decision-making environment 03, and the interaction process is completed in a circulating way until the ending condition is met, and a prediction result is output.

In one embodiment, the end condition includes completing the prediction of all time series monitored data within the monitored duration or outputting an adverse event tag.

In one embodiment, the early warning message may be initiated when the Agent outputs a predicted adverse event. Alternatively, cardiovascular critical adverse events are mainly comprised of four aspects. For example: heart failure, cardiac arrest, myocardial ischemia, and arrhythmia.

Alternatively, if a_tWaiting for more data input by the Agent, so as to continue observing the data sequence in the monitored data set, wherein the sequence observed next time is an additional time node subset sequence set: o is_t+1＝M_:t+1(ii) a If a is_t∈{∪_k∈LAnd (4) predicting label k, and finishing the sequence learning.

Alternatively, when parameter p is set to 0, the smart Agent will be subject to the same penalty as the time-independent delay prediction. The penalty becomes time-dependent when p > 0. The limited sequence of subsets at the beginning of the monitored data stream results in less penalty to the smart agent for delayed early warning than for receiving more inputs late in the set.

Optionally, the behavior of the prediction Agent is evaluated by a penalty function. If the data observed by the prediction Agent cannot identify the adverse cardiovascular event, the monitoring Agent waits to observe more monitored data or directly makes an early warning prompt.

And when the intelligent Agent gives a prediction and judges that the current time node data corresponds to a certain bad event, triggering the preset early warning information of the medical system. The early warning information may include text description information, voice alert information, etc. corresponding to the adverse event.

Referring to fig. 4, the embodiment provides a reinforcement learning based monitoring and early warning system for implementing the reinforcement learning based monitoring and early warning method in the foregoing embodiment. Since the technical principle of the system embodiment is similar to that of the method embodiment, repeated description of the same technical details is omitted.

In one embodiment, the reinforcement learning-based monitoring and early warning system includes an environment modeling module 10, an action modeling module 11, an environment response module 12, and a reinforcement learning module 13, where the environment modeling module 10 is configured to assist in performing step S01 described in the foregoing method embodiment; the action modeling module 11 is used to assist in executing step S02 described in the foregoing method embodiments; the environmental response module 12 is used to assist in executing step S03 described in the previous method embodiment; the reinforcement learning module 13 is used to assist in executing step S04 described in the foregoing method embodiment; the environmental response module 12 and the reinforcement learning module 13 are used to assist in executing step S05 described in the foregoing method embodiments.

The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Any person skilled in the art can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present invention.

Claims

1. A monitoring and early warning method based on reinforcement learning is characterized by comprising the following steps:

2. The reinforcement learning-based monitoring and early warning method according to claim 1, wherein the end condition comprises completion of prediction of all time-series monitoring data within a monitoring duration or output of an adverse event label.

3. The reinforcement learning-based monitoring and early warning method according to claim 1, wherein the agent selects a decision-making action according to the time-series monitoring data input at the current moment, and the decision-making action comprises:

4. The reinforcement learning-based monitoring and early warning method according to claim 1, wherein the decision environment outputs response information according to the decision action, and comprises:

5. The reinforcement learning-based monitoring and early-warning method according to claim 1, comprising constructing a reward and punishment utility function, wherein the decision environment outputs a reward and punishment value of the decision action according to the reward and punishment utility function.

6. The reinforcement learning-based monitoring and early warning method according to claim 5, wherein the reward and punishment utility function comprises:

is a compromise parameter for advance predictability and accuracy; predict label is an adverse event expected to be predicted; u-shaped_k∈L\lpredict laBel k is a mispredicted adverse event.

7. The reinforcement learning-based monitoring and early warning method according to claim 5, comprising:

8. The reinforcement learning-based monitoring and early warning method according to claim 7, wherein the evaluation function is represented as:

wherein c represents an incidence relation prediction model for predicting the time-series monitoring data and the adverse event label, D' is a test data set, and l is the adverse event label; # denotes the number of data in the set.

9. A guardianship early warning system based on reinforcement learning, characterized by comprising:

the reinforcement learning module is used for inputting the environment state into a pre-constructed depth reinforcement learning framework, and acquiring the action with the highest expected value in all selectable decision actions of the intelligent agent as the output of the next action decision of the intelligent agent; and interacting the intelligent agent and the decision-making environment according to the steps until an ending condition is met, and outputting a prediction result.