WO2022227198A1

WO2022227198A1 - Method and device for determining drug regimen of patient

Info

Publication number: WO2022227198A1
Application number: PCT/CN2021/097139
Authority: WO
Inventors: 徐卓扬; 赵婷婷; 孙行智; 胡岗
Original assignee: 平安科技（深圳）有限公司
Priority date: 2021-04-29
Filing date: 2021-05-31
Publication date: 2022-11-03
Also published as: CN113255735B; CN113255735A

Abstract

A method and device for determining a drug regimen of a patient, comprising the following steps: obtaining original state data of a target patient, the original state data being used for representing a condition feature of the patient (S100); inputting the original state data into an unbiased model to obtain unbiased state data in which state distribution deviation is eliminated (S200); inputting the unbiased state data into a deep reinforcement learning model to obtain reward values when different drug regimens are used for the target patient (S300); and determining a drug regimen of the target patient on the basis of a maximum reward value (S400). By introducing deviation elimination and learning reinforcement into the method and device determining a drug regimen of a patient, the selection deviation of a drug regimen is eliminated, such that the estimation of an expected reward is more accurate, thereby significantly enhancing the matching degree between a drug regimen and a patient.

Description

Method and device for determining medication regimen for patients

CROSS-REFERENCE TO RELATED APPLICATIONS

This application declares that it enjoys the priority of the Chinese patent application with the application number 202110474846.9 and titled "Method for Determining Medication Recipe for Patients and Device for Determining", which was filed on April 29, 2021, and the entire contents of this Chinese patent application are incorporated by reference. in this application.

technical field

The present application relates to the field of digital medical technology, and in particular, to a determination method and a determination device capable of realizing precise medical treatment of a patient's medication plan.

Background technique

Due to the specificity of the patient's physical condition, in order to achieve the best therapeutic effect, doctors often prescribe different medicines for patients with similar diseases. Conventional practice divides patients into different groups according to certain strategies, so as to provide targeted medication regimens based on group characteristics. The inventors realized that the accuracy of group division directly affects the treatment effect of patients. Since deep reinforcement learning methods can optimize long-term outcomes, they can be used to solve more and more sequential decision-making problems in real-world scenarios. At present, there are existing technologies for patient grouping through deep reinforcement learning methods.

The deep reinforcement learning model needs to use historical sample data of grouping medication for a large number of patients. Since these historical sample data are usually decided by doctors, there are inevitably biases in personal experience and knowledge reserves. When the deep reinforcement model estimates the value of different decisions in a specific state based on sample data, this bias will bias the estimate of the value of different decisions.

SUMMARY OF THE INVENTION

The purpose of this application is to provide a technical solution that can eliminate individual-specific deviations in the process of determining a patient's medication regimen, so as to solve the above-mentioned problems in the prior art, thereby improving the intelligence and accuracy of the process of determining a patient's drug regimen .

To achieve the above object, the application provides a method for determining a patient's medication regimen, comprising the following steps:

obtaining raw state data of the target patient, the raw state data being used to characterize the disease characteristics of the patient;

Inputting the original state data into an unbiased model to obtain unbiased state data with the state distribution deviation removed;

Input the unbiased state data into a deep reinforcement learning model to obtain the corresponding reward value when different medication regimens are taken for the target patient; wherein the reward value is based on the unbiased state data after the medication regimen is taken. Expected feedback effect;

A medication regimen for the target patient is determined based on the maximum reward value.

In order to achieve the above object, the present application provides a device for determining a medication regimen for a patient, comprising a memory, a processor, and a program for determining the medication regimen for a patient stored on the memory and running on the processor, wherein the processor executes all the procedures. The steps of implementing the following method when describing a procedure for determining a patient's medication regimen:

To achieve the above object, the present application provides a computer-readable storage medium, wherein the storage medium stores a program for determining a patient's medication regimen, and the following steps are implemented when the program for determining the patient's drug regimen is executed by a processor:

To achieve the above purpose, the present application also provides a device for determining a patient's medication regimen, including:

an original state acquisition module, which is suitable for acquiring original state data of the target patient, where the original state data is used to characterize the disease characteristics of the patient;

an unbiased processing module, suitable for inputting the original state data into an unbiased model, so as to obtain unbiased state data from which the state distribution deviation is eliminated;

The deep learning module is suitable for inputting the unbiased state data into a deep reinforcement learning model to obtain the corresponding reward value when different medication regimens are taken for the target patient; wherein the reward value is taken based on the unbiased state data. The expected feedback effect after the medication regimen;

The regimen determination module is adapted to determine the medication regimen of the target patient based on the maximum reward value.

The present application introduces the elimination of bias in the field of causal inference into the decision-making of reinforcement learning, optimizes the long-term cumulative return of decision-making selection and limits the estimation error caused by the selection bias, and improves the accuracy and security of the model in practical use. By introducing deviation elimination and reinforcement learning into the determination method and device for determining the medication plan of a patient, the selection bias of the medication plan is eliminated, and the estimation of the expected effect is more accurate, thereby enhancing the matching degree between the medication plan and the patient, and the treatment effect. significantly improved.

Description of drawings

Fig. 1 is a flow chart of Embodiment 1 of a method for determining a patient's medication regimen according to the present invention;

2 is a schematic structural diagram of an unbiased model according to Embodiment 1 of the present invention;

3 is a schematic flowchart of training a pair of unbiased models according to an embodiment of the present invention;

FIG. 4 is a schematic flowchart of training a pair of deep reinforcement learning models according to an embodiment of the present invention;

5 is a schematic diagram of an application of a deep reinforcement learning model according to Embodiment 1 of the present invention;

6 is a schematic diagram of a program module of Embodiment 1 of a device for determining a patient's medication regimen according to the present invention;

FIG. 7 is a schematic diagram of the hardware structure of Embodiment 1 of the device for determining a medication regimen for a patient according to the present invention.

Detailed ways

It should be understood that the specific embodiments described herein are only used to explain the present application, but not to limit the present application.

Example 1

This embodiment proposes a method for determining a patient's medication regimen, and the determining method can be applied to a terminal or a server. The terminals may include smart devices such as smart phones, notebook computers, and tablet computers, and the servers may include PCs, workgroup servers, and enterprise-level servers. Referring to FIG. 1, the determination method of this embodiment includes the following steps:

S100: Acquire original state data of the target patient, where the original state data is used to characterize the disease characteristics of the patient.

Deep reinforcement learning is used to realize the mapping strategy learning from state to action, learn the optimal mapping strategy according to the reward value corresponding to each action, select the optimal action according to the strategy, and obtain the delayed feedback value based on the state change caused by the optimal action , by iterating through the loop until the termination condition is met. In the embodiment of the present invention, the state refers to the original state data of the target patient, the action refers to a specific medication plan, and the reward value refers to the expected feedback effect after taking a specific medication plan based on the state of the target patient. The raw state data may include long-term medical follow-up records of the patient, such as demographic information, inspection indicators, medication history, and other data content at each follow-up visit. For multiple records, weighted sums can be performed at different times to obtain the overall record.

S200: Input the original state data into an unbiased model to obtain unbiased state data from which the state distribution deviation is eliminated.

FIG. 2 is a schematic structural diagram of an unbiased model according to Embodiment 1 of the present invention. As shown in Figure 2, the unbiased model includes an encoder, a decoder and a predictor, where the encoder is used to encode the input original state data s to output unbiased state data E(s), and the decoder is used to encode the unbiased state data E(s) The data E(s) is decoded to obtain the analytical state data D(E(s)) corresponding to the original state data, and the predictor predicts to take a different action a based on the input analytical state data D(E(s)) (i.e. The corresponding reward value R(s, a) when the drug regimen is used. The above encoder, decoder and predictor can all be implemented by a single-layer neural network.

On the one hand, the unbiased model provided by this embodiment can make the encoded unbiased state data E(s) have the ability to predict the reward value R(s, a) by combining the encoder and the predictor; on the other hand , the unbiased model provided by this embodiment can retain enough original input information by combining the encoder and the decoder, thereby ensuring the accuracy of the prediction result. It can be understood that by constructing a suitable loss function to train the unbiased model, the tendency of the unbiased model to select a specific action in a specific state can be affected. Regarding the specific composition of the loss function, this embodiment will be described in detail below.

S300: Input the unbiased state data into a deep reinforcement learning model to obtain the corresponding reward value when different medication regimens are adopted for the target patient.

Those skilled in the art understand that the input and output data involved in the deep reinforcement model include states, actions and reward values. The deep reinforcement learning model uses a neural network to fit a policy. After the network adopts a policy for the input state (state), it outputs the expected reward value (reward) corresponding to each action (action), of which the largest reward value corresponds to The action is the best action that the deep reinforcement model thinks it should choose. In this embodiment, the state refers to the multi-dimensional encoding of the original state data of the target patient, the action refers to the multi-dimensional encoding of the medication regimen, and the expected reward value refers to the specific original state data. Coded data on the feedback effects of taking a specific medication regimen. It should be noted that the input state (state) in this embodiment may be the unbiased state data E(s) output by the encoder in the unbiased model, which may specifically include demographic information, inspection indicators, medication history The composed multi-dimensional vector encoding data, by using the unbiased state data E(s) as the input state data of the deep reinforcement learning model, can eliminate the specificity in the state data and make the output results of the reinforcement learning model more accurate.

S400: Determine the medication regimen of the target patient based on the maximum reward value.

In this step, a medication plan with the best expected therapeutic effect can be determined based on the patient's state data, so as to formulate a more appropriate treatment plan for different patients in a more targeted manner, and significantly improve the therapeutic effect. In this embodiment, the therapeutic effect of the drug regimen can be determined according to the size of the reward value, for example, the reward value with the largest value generally indicates the best therapeutic effect. Assuming that for a patient's unbiased state data E0, the available drug regimens include A1, A2, and A3. The deep reinforcement learning model outputs the reward values for each drug regimen as R1, R2, and R3, respectively. If R1>R2>R3 , then R1 is the reward value with the best therapeutic effect, and the medication plan A1 corresponding to R1 is the finalized medication plan.

Through the above steps, the unbiased model provided by this solution can remove the bias existing in the patient data to the greatest extent on the basis of retaining the original information of the patient data, thereby ensuring the objectivity of the input data in the deep reinforcement learning model and making the deep reinforcement learning model. The output of patient classification is more accurate and fair.

FIG. 3 shows a schematic flowchart of training a pair of unbiased models according to an embodiment of the present invention. As shown in Figure 3, training an unbiased model involves the following steps:

S310: Acquire first historical sample data of multiple patients, where the first historical sample data includes first historical state data, first historical action data, and first historical reward data.

Wherein, the above-mentioned first historical status data includes the patient's demographic information, inspection and inspection indicators and medication history; the first historical action data includes the medication plan prescribed by the doctor for the patient, and the first historical reward data includes the patient's taking Health feedback information after the described medication regimen.

S320: Use the first historical state data as the input of the encoder, and use the first historical reward data as the output of the predictor to train an unbiased model to determine the weight parameters in the encoder, the decoder and the predictor .

S330: When the loss function of the unbiased model converges to the preset threshold, the training process of the unbiased model ends.

In one example, the loss function Loss1 of the above unbiased model is determined by the following equation:

Loss1=Lce+Linf+Lr;

Lce=∑ _a∈A p(a)*log[p(a)/p(a|E(s))];

Among them, s represents the current first historical state data, E(s) represents the first historical unbiased state data output by s after passing through the encoder, a represents the current first historical action data, and A represents the set of all the first historical action data , p(a) represents the probability of selecting the current historical state data among all the first historical action data, p(a|E(s)) represents the probability of taking the current first historical action data under the current first historical unbiased state data probability, D(E(s)) represents the first historical analytical state data output after the first historical unbiased state data passes through the decoder,

Represents the L2 regularization of x, r represents the current first historical reward data, and R(E(s), a) represents the first historical reward corresponding to the current first historical action data under the current first historical unbiased state data data.

In the above formula, Lce is a KL divergence loss function. By making the conditional probability of each action taken in the encoded space approximate to the proportion of each action taken in the whole sample, the propensity to select an action in the encoded space is related to the input. irrelevant, thereby removing the bias of action selection. The purpose of Linf is to allow the encoded space to retain enough original state information; the purpose of Lr is to enable the encoded space to have the ability to predict reward, that is, to add reward information to the encoded space. Using these three loss functions, the encoded unbiased state data E(s) removes the tendency to select actions in a specific state, while retaining sufficient original input information and reward prediction ability. In this way, using the unbiased state data E(s) as the input to the deep reinforcement learning model can lead to a more unbiased expected reward value.

FIG. 4 shows a schematic flowchart of training a pair of deep reinforcement learning models according to an embodiment of the present invention. As shown in Figure 4, the deep reinforcement learning model includes the following steps:

S410: Acquire second historical sample data of multiple patients, where the second historical sample data includes second historical state data, second historical action data, and second historical reward data.

Wherein, the second historical state data includes demographic information, inspection and inspection indicators and medication history of the patient; the second historical action data includes the medication plan prescribed by the doctor for the patient, and the second historical reward data includes all The health feedback information of the patient after taking the medication regimen. Specifically, the second historical reward data may include short-term reward data and long-term reward data, wherein the weight of the long-term reward value is higher than the weight of the short-term reward value. The short-term reward data and long-term reward data here are determined according to the follow-up time. For example, it is stipulated that feedback information within one year belongs to short-term reward data, and feedback information of more than one year belongs to long-term reward data. For patients, the long-term effect after treatment is obviously more important than the short-term effect, so this embodiment sets a higher weight value for the long-term reward data, for example, the weight of the short-term reward data is set to 1, and the long-term reward data The weight is set to 5, so that the second historical reward data can better reflect the long-term effect.

S420: Use the second historical state data as input, and use the second historical reward data as output to train the strategy function in the deep reinforcement learning model, so that the deep reinforcement learning model passes all the data based on the second historical state data. When the policy function selects the corresponding second historical action data, the outputted second historical reward data is the largest.

S430: When the loss function of the deep reinforcement learning model converges to a preset threshold, the training process ends.

In one example, the loss function Loss2 of the above-mentioned deep reinforcement learning model is determined by the following formula:

Loss2=(Q(s _t ,at )-(r _t +max(γ×Q(s _t ₊₁ ,a))) ² ;

In the above formula, s _t represents the second historical state data at time _t , at represents the second historical action data at time _t , and rt represents the second historical state data s _t corresponding to the second historical action data at _t. The second historical reward data of ; Q(s _t+1 , a) represents the second historical reward data obtained when the second historical action data is taken for the second historical state data at time t+1, and γ is a constant.

FIG. 5 is a schematic diagram of an application of a deep reinforcement learning model according to Embodiment 1 of the present invention. As shown in Figure 5, the deep learning model is connected to the encoder in the unbiased model, and the unbiased state data E(s) output by the encoder is used as the input data of the deep learning model, and finally the deep learning model outputs the same state data respectively. s takes the corresponding reward value of different actions a. Among them, Q(s, a0), Q(s, a1)...Q(s, an) in Figure 5 respectively represent the reward values obtained by taking different actions an.

Please continue to refer to FIG. 6 , which shows a device for determining a medication regimen of a patient. In this embodiment, the device 60 for determining a drug regimen for a patient may include or be divided into one or more program modules, one or more program modules is stored in a storage medium and executed by one or more processors to complete the present invention and implement the above-mentioned method for determining a patient's medication regimen. The program module referred to in the present invention refers to a series of computer program instruction segments capable of accomplishing specific functions, and is more suitable for describing the execution process of the device 60 for determining a patient's medication regimen in the storage medium than the program itself. The following description will specifically introduce the functions of each program module in this embodiment:

The original state acquisition module 61 is suitable for acquiring the original state data of the target patient, and the original state data is used to characterize the disease characteristics of the patient;

an unbiased processing module 62, adapted to input the original state data into an unbiased model, so as to obtain unbiased state data from which the state distribution deviation is eliminated;

The deep learning module 63 is suitable for inputting the unbiased state data into a deep reinforcement learning model to obtain reward values corresponding to different medication regimens for the target patient; wherein the reward values are based on the unbiased state data The expected feedback effect after taking the medication regimen;

The regimen determination module 64 is adapted to determine the medication regimen of the target patient based on the maximum reward value.

The device for determining a patient's medication plan provided in this embodiment eliminates the deviation of action selection through the unbiased processing module, so that the estimation of the expected reward is more accurate, thereby ensuring that the deep learning module is fitted to obtain a more reasonable expected reward value, thereby improving the patient the therapeutic effect.

This embodiment also provides a computer device, such as a smart phone, a tablet computer, a notebook computer, a desktop computer, a rack server, a blade server, a tower server, or a cabinet server (including independent servers, or A server cluster composed of multiple servers), etc. The computer device 70 in this embodiment at least includes but is not limited to: a memory 71 and a processor 72 that can be communicatively connected to each other through a system bus, as shown in FIG. 7 . It should be noted that FIG. 7 only shows a computer device 70 having components 71-72, but it should be understood that implementation of all of the illustrated components is not required, and more or fewer components may be implemented instead.

In this embodiment, the memory 71 (ie, a readable storage medium) includes a flash memory, a hard disk, a multimedia card, a card-type memory (eg, SD or DX memory, etc.), random access memory (RAM), static random access memory (SRAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), Programmable Read Only Memory (PROM), Magnetic Memory, Magnetic Disk, Optical Disk, etc. In some embodiments, the memory 71 may be an internal storage unit of the computer device 70 , such as a hard disk or memory of the computer device 70 . In other embodiments, the memory 71 may also be an external storage device of the computer device 70, such as a plug-in hard disk, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital, SD) card, flash memory card (Flash Card), etc. Of course, the memory 71 may also include both the internal storage unit of the computer device 70 and its external storage device. In this embodiment, the memory 71 is generally used to store the operating system and various application software installed on the computer device 70 , such as the program code of the device 60 for determining a patient's medication regimen in the first embodiment. In addition, the memory 71 can also be used to temporarily store various types of data that have been output or will be output.

The processor 72 may be a central processing unit (CPU), a controller, a microcontroller, a microprocessor, or other data processing chips in some embodiments. The processor 72 is typically used to control the overall operation of the computer device 70 . In this embodiment, the processor 72 is configured to run program codes or process data stored in the memory 71 , for example, run the device 60 for determining a patient's medication regimen to implement the method for determining a patient's medication regimen in the first embodiment.

This embodiment also provides a computer-readable storage medium, such as a flash memory, a hard disk, a multimedia card, a card-type memory (for example, SD or DX memory, etc.), random access memory (RAM), static random access memory (SRAM), only Read-only memory (ROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Programmable Read-Only Memory (PROM), magnetic memory, magnetic disk, optical disk, server, App application mall, etc., on which computer programs are stored, When the program is executed by the processor, the corresponding function is realized. The computer-readable storage medium of this embodiment is used to store the device 60 for determining a patient's medication regimen, and when executed by a processor, implements the method for determining a patient's medication plan of the first embodiment.

It should be noted that, herein, the terms "comprising", "comprising" or any other variation thereof are intended to encompass non-exclusive inclusion, such that a process, method, article or system comprising a series of elements includes not only those elements, It also includes other elements not expressly listed or inherent to such a process, method, article or system. Without further limitation, an element qualified by the phrase "comprising a..." does not preclude the presence of additional identical elements in the process, method, article or system that includes the element.

The above-mentioned serial numbers of the embodiments of the present application are only for description, and do not represent the advantages or disadvantages of the embodiments. In a unit claim enumerating several means, several of these means may be embodied by one and the same item of hardware. The use of the words first, second, and third, etc. do not denote any order, and these words may be construed as identifications.

From the description of the above embodiments, those skilled in the art can clearly understand that the methods of the above embodiments can be implemented by means of software plus a necessary general hardware platform, and of course hardware can also be used, but in many cases the former is better implementation. Based on this understanding, the technical solutions of the present application can be embodied in the form of software products in essence or the parts that make contributions to the prior art, and the computer software products are stored in a storage medium (such as a read-only memory image (Read Only Memory image, ROM)/Random Access Memory (Random Access Memory, RAM), magnetic disk, CD-ROM), including several instructions to make a terminal device (can be a mobile phone, computer, server, air conditioner, or network device) etc.) to perform the methods described in the various embodiments of the present application.

The above are only the preferred embodiments of the present application, and are not intended to limit the patent scope of the present application. Any equivalent structure or equivalent process transformation made by using the contents of the description and drawings of the present application, or directly or indirectly applied in other related technical fields , are similarly included within the scope of patent protection of this application.

Claims

A method for determining a patient's medication regimen, comprising the following steps:

obtaining raw state data of the target patient, the raw state data being used to characterize the disease characteristics of the patient;

Inputting the original state data into an unbiased model to obtain unbiased state data with the state distribution deviation removed;

Input the unbiased state data into a deep reinforcement learning model to obtain the corresponding reward value when different medication regimens are taken for the target patient; wherein the reward value is based on the unbiased state data after the medication regimen is taken. Expected feedback effect;

A medication regimen for the target patient is determined based on the maximum reward value.
The method for determining a patient's medication regimen according to claim 1, wherein the unbiased model comprises an encoder, a decoder and a predictor, and the encoder is used to encode the raw state data to output unbiased state data , the decoder is used to decode the unbiased state data to obtain analytical state data corresponding to the original state data, and the predictor predicts the corresponding rewards when different medication regimens are taken based on the analytical state data value; wherein, the encoder, the decoder, and the predictor are all single-layer neural networks.
The method for determining a patient's medication regimen according to claim 2, wherein the training process of the unbiased model comprises the following steps:

Obtaining first historical sample data of a plurality of patients, the first historical sample data includes first historical state data, first historical action data and first historical reward data; wherein, the first historical state data includes the patient Demographic information, inspection indicators and medication history; the first historical action data includes the medication plan prescribed by the doctor for the patient, and the first historical reward data includes the patient's health after taking the medication plan Feedback;

training the unbiased model using the first historical state data as the input of the encoder and the first historical reward data as the output of the predictor to determine the encoder, the decoder and the weight parameters in the predictor;

When the loss function of the unbiased model converges to a preset threshold, the training process of the unbiased model ends.
The method for determining a patient's medication regimen according to claim 3, wherein the loss function Loss1 of the unbiased model is determined by the following formula:

Loss1=Lce+Linf+Lr;

Lce=∑ a∈A p(a)*log[p(a)/p(a|E(s))];

Among them, s represents the current first historical state data, E(s) represents the first historical unbiased state data output by s after passing through the encoder, a represents the current first historical action data, and A represents the set of all the first historical action data , p(a) represents the probability of selecting the current historical state data among all the first historical action data, p(a|E(s)) represents the probability of taking the current first historical action data under the current first historical unbiased state data probability, D(E(s)) represents the first historical analytical state data output after the first historical unbiased state data passes through the decoder,
Represents the L2 regularization of x, r represents the current first historical reward data, and R(E(s), a) represents the first historical reward corresponding to the current first historical action data under the current first historical unbiased state data data.
The method for determining a patient's medication regimen according to claim 1, wherein the training process of the deep reinforcement learning model comprises the following steps:

Acquiring second historical sample data of a plurality of patients, the second historical sample data includes second historical state data, second historical action data and second historical reward data; wherein the second historical state data includes the patient Demographic information, inspection indicators and medication history; the second historical action data includes the medication plan prescribed by the doctor for the patient, and the second historical reward data includes the patient's health after taking the medication plan Feedback;

Using the second historical state data as input and the second historical reward data as output to train a policy function in the deep reinforcement learning model so that the deep reinforcement learning model is based on the second historical state data The second historical reward data outputted when the corresponding second historical action data is selected by the strategy function is the largest;

When the loss function of the deep reinforcement learning model converges to a preset threshold, the training process ends.
The method for determining a patient's medication regimen according to claim 5, wherein the loss function Loss2 of the deep reinforcement learning model is determined by the following formula:

Loss2=(Q(s t ,at )-(r t +max(γ×Q(s t +1 ,a))) 2 ;

In the above formula, s t represents the second historical state data at time t , at represents the second historical action data at time t , and rt represents the second historical state data s t corresponding to the second historical action data at t. The second historical reward data of ; Q(s t+1 , a) represents the second historical reward data obtained when the second historical action data is taken for the second historical state data at time t+1, and γ is a constant.
The method for determining a patient's medication regimen according to claim 5, wherein the second historical reward data includes short-term reward data and long-term reward data, and the weight of the long-term reward data is higher than the weight of the short-term reward data.
A device for determining a patient's medication regimen, comprising a memory, a processor, and a patient's medication plan determination program stored on the memory and executable on the processor, wherein, when the processor executes the patient's medication plan determination program Steps to implement the following method:

obtaining raw state data of the target patient, the raw state data being used to characterize the disease characteristics of the patient;

Inputting the original state data into an unbiased model to obtain unbiased state data with the state distribution deviation removed;

Input the unbiased state data into a deep reinforcement learning model to obtain the corresponding reward value when different medication regimens are taken for the target patient; wherein the reward value is based on the unbiased state data after the medication regimen is taken. Expected feedback effect;

A medication regimen for the target patient is determined based on the maximum reward value.
The device for determining a medication regimen of a patient according to claim 8, wherein the unbiased model comprises an encoder, a decoder and a predictor, the encoder is used to encode the raw state data to output unbiased state data , the decoder is used to decode the unbiased state data to obtain analytical state data corresponding to the original state data, and the predictor predicts the corresponding rewards when different medication regimens are taken based on the analytical state data value; wherein, the encoder, the decoder and the predictor are all single-layer neural networks.
The device for determining a patient's medication regimen according to claim 9, wherein the training process of the unbiased model comprises the following steps:

Obtaining first historical sample data of a plurality of patients, the first historical sample data includes first historical state data, first historical action data and first historical reward data; wherein, the first historical state data includes the patient Demographic information, inspection indicators and medication history; the first historical action data includes the medication plan prescribed by the doctor for the patient, and the first historical reward data includes the patient's health after taking the medication plan Feedback;

training the unbiased model using the first historical state data as the input of the encoder and the first historical reward data as the output of the predictor to determine the encoder, the decoder and the weight parameters in the predictor;

When the loss function of the unbiased model converges to a preset threshold, the training process of the unbiased model ends.
The device for determining a patient's medication regimen according to claim 10, wherein the loss function Loss1 of the unbiased model is determined by the following formula:

Loss1=Lce+Linf+Lr;

Lce=∑ a∈A p(a)*log[p(a)/p(a|E(s))];

Among them, s represents the current first historical state data, E(s) represents the first historical unbiased state data output by s after passing through the encoder, a represents the current first historical action data, and A represents the set of all the first historical action data , p(a) represents the probability of selecting the current historical state data among all the first historical action data, p(a|E(s)) represents the probability of taking the current first historical action data under the current first historical unbiased state data probability, D(E(s)) represents the first historical analytical state data output after the first historical unbiased state data passes through the decoder,
Represents the L2 regularization of x, r represents the current first historical reward data, and R(E(s), a) represents the first historical reward corresponding to the current first historical action data under the current first historical unbiased state data data.
The device for determining a patient's medication regimen according to claim 8, wherein the training process of the deep reinforcement learning model comprises the following steps:

Acquiring second historical sample data of a plurality of patients, the second historical sample data includes second historical state data, second historical action data and second historical reward data; wherein the second historical state data includes the patient Demographic information, inspection indicators and medication history; the second historical action data includes the medication plan prescribed by the doctor for the patient, and the second historical reward data includes the patient's health after taking the medication plan Feedback;

Using the second historical state data as input and the second historical reward data as output to train a policy function in the deep reinforcement learning model so that the deep reinforcement learning model is based on the second historical state data The second historical reward data outputted when the corresponding second historical action data is selected by the strategy function is the largest;

When the loss function of the deep reinforcement learning model converges to a preset threshold, the training process ends.
The device for determining a patient's medication regimen according to claim 12, wherein the loss function Loss2 of the deep reinforcement learning model is determined by the following formula:

Loss2=(Q(s t ,at )-(r t +max(γ×Q(s t +1 ,a))) 2 ;

In the above formula, s t represents the second historical state data at time t , at represents the second historical action data at time t , and rt represents the second historical state data s t corresponding to the second historical action data at t. The second historical reward data of ; Q(s t+1 , a) represents the second historical reward data obtained when the second historical action data is taken for the second historical state data at time t+1, and γ is a constant.
The device for determining a patient's medication regimen according to claim 12, wherein the second historical reward data includes short-term reward data and long-term reward data, and the weight of the long-term reward data is higher than the weight of the short-term reward data.
A computer-readable storage medium, wherein a program for determining a patient's medication regimen is stored on the storage medium, and when the program for determining a patient's medication regimen is executed by a processor, the following steps are implemented:

obtaining raw state data of the target patient, the raw state data being used to characterize the disease characteristics of the patient;

Inputting the original state data into an unbiased model to obtain unbiased state data with the state distribution deviation removed;

Input the unbiased state data into a deep reinforcement learning model to obtain the corresponding reward value when different medication regimens are taken for the target patient; wherein the reward value is based on the unbiased state data after the medication regimen is taken. Expected feedback effect;

A medication regimen for the target patient is determined based on the maximum reward value.
16. The storage medium of claim 15, wherein the unbiased model includes an encoder, a decoder, and a predictor, the encoder for encoding the raw state data to output unbiased state data, the decoding The predictor is used to decode the unbiased state data to obtain analytical state data corresponding to the original state data, and the predictor predicts, based on the analytical state data, corresponding reward values when different medication regimens are taken; wherein, The encoder, the decoder and the predictor are all single-layer neural networks.
The storage medium of claim 16, wherein the training process of the unbiased model comprises the steps of:

Obtaining first historical sample data of a plurality of patients, the first historical sample data includes first historical state data, first historical action data and first historical reward data; wherein, the first historical state data includes the patient Demographic information, inspection indicators and medication history; the first historical action data includes the medication plan prescribed by the doctor for the patient, and the first historical reward data includes the patient's health after taking the medication plan Feedback;

training the unbiased model using the first historical state data as the input of the encoder and the first historical reward data as the output of the predictor to determine the encoder, the decoder and the weight parameters in the predictor;

When the loss function of the unbiased model converges to a preset threshold, the training process of the unbiased model ends.
The storage medium according to claim 17, wherein the loss function Loss1 of the unbiased model is determined by the following formula:

Loss1=Lce+Linf+Lr;

Lce=∑ a∈A p(a)*log[p(a)/p(a|E(s))];

Among them, s represents the current first historical state data, E(s) represents the first historical unbiased state data output by s after passing through the encoder, a represents the current first historical action data, and A represents the set of all the first historical action data , p(a) represents the probability of selecting the current historical state data among all the first historical action data, p(a|E(s)) represents the probability of taking the current first historical action data under the current first historical unbiased state data probability, D(E(s)) represents the first historical analytical state data output after the first historical unbiased state data passes through the decoder,
Represents the L2 regularization of x, r represents the current first historical reward data, and R(E(s), a) represents the first historical reward corresponding to the current first historical action data under the current first historical unbiased state data data.
The storage medium according to claim 15, wherein the training process of the deep reinforcement learning model comprises the following steps:

Acquiring second historical sample data of a plurality of patients, the second historical sample data includes second historical state data, second historical action data and second historical reward data; wherein the second historical state data includes the patient Demographic information, inspection indicators and medication history; the second historical action data includes the medication plan prescribed by the doctor for the patient, and the second historical reward data includes the patient's health after taking the medication plan Feedback;

Using the second historical state data as input and the second historical reward data as output to train a policy function in the deep reinforcement learning model so that the deep reinforcement learning model is based on the second historical state data The second historical reward data outputted when the corresponding second historical action data is selected by the strategy function is the largest;

When the loss function of the deep reinforcement learning model converges to a preset threshold, the training process ends.
A device for determining a patient's medication regimen, comprising:

an original state acquisition module, which is suitable for acquiring original state data of the target patient, where the original state data is used to characterize the disease characteristics of the patient;

an unbiased processing module, suitable for inputting the original state data into an unbiased model, so as to obtain unbiased state data from which the state distribution deviation is eliminated;

The deep learning module is suitable for inputting the unbiased state data into the deep reinforcement learning model, and obtains the corresponding reward value when different medication regimens are taken for the target patient; wherein the reward value is taken based on the unbiased state data. The expected feedback effect after the medication regimen;

The regimen determination module is adapted to determine the medication regimen of the target patient based on the maximum reward value.