WO2022227198A1 - Method and device for determining drug regimen of patient - Google Patents

Method and device for determining drug regimen of patient Download PDF

Info

Publication number
WO2022227198A1
WO2022227198A1 PCT/CN2021/097139 CN2021097139W WO2022227198A1 WO 2022227198 A1 WO2022227198 A1 WO 2022227198A1 CN 2021097139 W CN2021097139 W CN 2021097139W WO 2022227198 A1 WO2022227198 A1 WO 2022227198A1
Authority
WO
WIPO (PCT)
Prior art keywords
historical
data
state data
patient
unbiased
Prior art date
Application number
PCT/CN2021/097139
Other languages
French (fr)
Chinese (zh)
Inventor
徐卓扬
赵婷婷
孙行智
胡岗
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2022227198A1 publication Critical patent/WO2022227198A1/en

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H20/00ICT specially adapted for therapies or health-improving plans, e.g. for handling prescriptions, for steering therapy or for monitoring patient compliance
    • G16H20/10ICT specially adapted for therapies or health-improving plans, e.g. for handling prescriptions, for steering therapy or for monitoring patient compliance relating to drugs or medications, e.g. for ensuring correct administration to patients
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H70/00ICT specially adapted for the handling or processing of medical references
    • G16H70/40ICT specially adapted for the handling or processing of medical references relating to drugs, e.g. their side effects or intended usage

Definitions

  • the present application relates to the field of digital medical technology, and in particular, to a determination method and a determination device capable of realizing precise medical treatment of a patient's medication plan.
  • the deep reinforcement learning model needs to use historical sample data of grouping medication for a large number of patients. Since these historical sample data are usually decided by doctors, there are inevitably biases in personal experience and knowledge reserves. When the deep reinforcement model estimates the value of different decisions in a specific state based on sample data, this bias will bias the estimate of the value of different decisions.
  • the purpose of this application is to provide a technical solution that can eliminate individual-specific deviations in the process of determining a patient's medication regimen, so as to solve the above-mentioned problems in the prior art, thereby improving the intelligence and accuracy of the process of determining a patient's drug regimen .
  • the application provides a method for determining a patient's medication regimen, comprising the following steps:
  • the raw state data being used to characterize the disease characteristics of the patient
  • a medication regimen for the target patient is determined based on the maximum reward value.
  • the present application provides a device for determining a medication regimen for a patient, comprising a memory, a processor, and a program for determining the medication regimen for a patient stored on the memory and running on the processor, wherein the processor executes all the procedures.
  • the raw state data being used to characterize the disease characteristics of the patient
  • a medication regimen for the target patient is determined based on the maximum reward value.
  • the present application provides a computer-readable storage medium, wherein the storage medium stores a program for determining a patient's medication regimen, and the following steps are implemented when the program for determining the patient's drug regimen is executed by a processor:
  • the raw state data being used to characterize the disease characteristics of the patient
  • a medication regimen for the target patient is determined based on the maximum reward value.
  • the present application also provides a device for determining a patient's medication regimen, including:
  • an original state acquisition module which is suitable for acquiring original state data of the target patient, where the original state data is used to characterize the disease characteristics of the patient;
  • an unbiased processing module suitable for inputting the original state data into an unbiased model, so as to obtain unbiased state data from which the state distribution deviation is eliminated;
  • the deep learning module is suitable for inputting the unbiased state data into a deep reinforcement learning model to obtain the corresponding reward value when different medication regimens are taken for the target patient; wherein the reward value is taken based on the unbiased state data.
  • the regimen determination module is adapted to determine the medication regimen of the target patient based on the maximum reward value.
  • the present application introduces the elimination of bias in the field of causal inference into the decision-making of reinforcement learning, optimizes the long-term cumulative return of decision-making selection and limits the estimation error caused by the selection bias, and improves the accuracy and security of the model in practical use.
  • the selection bias of the medication plan is eliminated, and the estimation of the expected effect is more accurate, thereby enhancing the matching degree between the medication plan and the patient, and the treatment effect. significantly improved.
  • Fig. 1 is a flow chart of Embodiment 1 of a method for determining a patient's medication regimen according to the present invention
  • FIG. 2 is a schematic structural diagram of an unbiased model according to Embodiment 1 of the present invention.
  • FIG. 3 is a schematic flowchart of training a pair of unbiased models according to an embodiment of the present invention
  • FIG. 4 is a schematic flowchart of training a pair of deep reinforcement learning models according to an embodiment of the present invention
  • FIG. 5 is a schematic diagram of an application of a deep reinforcement learning model according to Embodiment 1 of the present invention.
  • FIG. 6 is a schematic diagram of a program module of Embodiment 1 of a device for determining a patient's medication regimen according to the present invention
  • FIG. 7 is a schematic diagram of the hardware structure of Embodiment 1 of the device for determining a medication regimen for a patient according to the present invention.
  • This embodiment proposes a method for determining a patient's medication regimen, and the determining method can be applied to a terminal or a server.
  • the terminals may include smart devices such as smart phones, notebook computers, and tablet computers, and the servers may include PCs, workgroup servers, and enterprise-level servers.
  • the determination method of this embodiment includes the following steps:
  • S100 Acquire original state data of the target patient, where the original state data is used to characterize the disease characteristics of the patient.
  • Deep reinforcement learning is used to realize the mapping strategy learning from state to action, learn the optimal mapping strategy according to the reward value corresponding to each action, select the optimal action according to the strategy, and obtain the delayed feedback value based on the state change caused by the optimal action , by iterating through the loop until the termination condition is met.
  • the state refers to the original state data of the target patient
  • the action refers to a specific medication plan
  • the reward value refers to the expected feedback effect after taking a specific medication plan based on the state of the target patient.
  • the raw state data may include long-term medical follow-up records of the patient, such as demographic information, inspection indicators, medication history, and other data content at each follow-up visit. For multiple records, weighted sums can be performed at different times to obtain the overall record.
  • S200 Input the original state data into an unbiased model to obtain unbiased state data from which the state distribution deviation is eliminated.
  • FIG. 2 is a schematic structural diagram of an unbiased model according to Embodiment 1 of the present invention.
  • the unbiased model includes an encoder, a decoder and a predictor, where the encoder is used to encode the input original state data s to output unbiased state data E(s), and the decoder is used to encode the unbiased state data E(s)
  • the data E(s) is decoded to obtain the analytical state data D(E(s)) corresponding to the original state data, and the predictor predicts to take a different action a based on the input analytical state data D(E(s)) (i.e. The corresponding reward value R(s, a) when the drug regimen is used.
  • the above encoder, decoder and predictor can all be implemented by a single-layer neural network.
  • the unbiased model provided by this embodiment can make the encoded unbiased state data E(s) have the ability to predict the reward value R(s, a) by combining the encoder and the predictor; on the other hand , the unbiased model provided by this embodiment can retain enough original input information by combining the encoder and the decoder, thereby ensuring the accuracy of the prediction result. It can be understood that by constructing a suitable loss function to train the unbiased model, the tendency of the unbiased model to select a specific action in a specific state can be affected. Regarding the specific composition of the loss function, this embodiment will be described in detail below.
  • S300 Input the unbiased state data into a deep reinforcement learning model to obtain the corresponding reward value when different medication regimens are adopted for the target patient.
  • the input and output data involved in the deep reinforcement model include states, actions and reward values.
  • the deep reinforcement learning model uses a neural network to fit a policy. After the network adopts a policy for the input state (state), it outputs the expected reward value (reward) corresponding to each action (action), of which the largest reward value corresponds to The action is the best action that the deep reinforcement model thinks it should choose.
  • the state refers to the multi-dimensional encoding of the original state data of the target patient
  • the action refers to the multi-dimensional encoding of the medication regimen
  • the expected reward value refers to the specific original state data. Coded data on the feedback effects of taking a specific medication regimen.
  • the input state (state) in this embodiment may be the unbiased state data E(s) output by the encoder in the unbiased model, which may specifically include demographic information, inspection indicators, medication history
  • the composed multi-dimensional vector encoding data by using the unbiased state data E(s) as the input state data of the deep reinforcement learning model, can eliminate the specificity in the state data and make the output results of the reinforcement learning model more accurate.
  • S400 Determine the medication regimen of the target patient based on the maximum reward value.
  • a medication plan with the best expected therapeutic effect can be determined based on the patient's state data, so as to formulate a more appropriate treatment plan for different patients in a more targeted manner, and significantly improve the therapeutic effect.
  • the therapeutic effect of the drug regimen can be determined according to the size of the reward value, for example, the reward value with the largest value generally indicates the best therapeutic effect.
  • the available drug regimens include A1, A2, and A3.
  • the deep reinforcement learning model outputs the reward values for each drug regimen as R1, R2, and R3, respectively. If R1>R2>R3 , then R1 is the reward value with the best therapeutic effect, and the medication plan A1 corresponding to R1 is the finalized medication plan.
  • the unbiased model provided by this solution can remove the bias existing in the patient data to the greatest extent on the basis of retaining the original information of the patient data, thereby ensuring the objectivity of the input data in the deep reinforcement learning model and making the deep reinforcement learning model.
  • the output of patient classification is more accurate and fair.
  • FIG. 3 shows a schematic flowchart of training a pair of unbiased models according to an embodiment of the present invention. As shown in Figure 3, training an unbiased model involves the following steps:
  • S310 Acquire first historical sample data of multiple patients, where the first historical sample data includes first historical state data, first historical action data, and first historical reward data.
  • the above-mentioned first historical status data includes the patient's demographic information, inspection and inspection indicators and medication history; the first historical action data includes the medication plan prescribed by the doctor for the patient, and the first historical reward data includes the patient's taking Health feedback information after the described medication regimen.
  • S320 Use the first historical state data as the input of the encoder, and use the first historical reward data as the output of the predictor to train an unbiased model to determine the weight parameters in the encoder, the decoder and the predictor .
  • the loss function Loss1 of the above unbiased model is determined by the following equation:
  • Loss1 Lce+Linf+Lr
  • s represents the current first historical state data
  • E(s) represents the first historical unbiased state data output by s after passing through the encoder
  • a represents the current first historical action data
  • A represents the set of all the first historical action data
  • p(a) represents the probability of selecting the current historical state data among all the first historical action data
  • E(s)) represents the probability of taking the current first historical action data under the current first historical unbiased state data probability
  • D(E(s)) represents the first historical analytical state data output after the first historical unbiased state data passes through the decoder
  • r represents the current first historical reward data
  • R(E(s), a) represents the first historical reward corresponding to the current first historical action data under the current first historical unbiased state data data.
  • Lce is a KL divergence loss function.
  • the purpose of Linf is to allow the encoded space to retain enough original state information; the purpose of Lr is to enable the encoded space to have the ability to predict reward, that is, to add reward information to the encoded space.
  • the encoded unbiased state data E(s) removes the tendency to select actions in a specific state, while retaining sufficient original input information and reward prediction ability. In this way, using the unbiased state data E(s) as the input to the deep reinforcement learning model can lead to a more unbiased expected reward value.
  • FIG. 4 shows a schematic flowchart of training a pair of deep reinforcement learning models according to an embodiment of the present invention.
  • the deep reinforcement learning model includes the following steps:
  • S410 Acquire second historical sample data of multiple patients, where the second historical sample data includes second historical state data, second historical action data, and second historical reward data.
  • the second historical state data includes demographic information, inspection and inspection indicators and medication history of the patient;
  • the second historical action data includes the medication plan prescribed by the doctor for the patient, and
  • the second historical reward data includes all The health feedback information of the patient after taking the medication regimen.
  • the second historical reward data may include short-term reward data and long-term reward data, wherein the weight of the long-term reward value is higher than the weight of the short-term reward value.
  • the short-term reward data and long-term reward data here are determined according to the follow-up time. For example, it is stipulated that feedback information within one year belongs to short-term reward data, and feedback information of more than one year belongs to long-term reward data.
  • this embodiment sets a higher weight value for the long-term reward data, for example, the weight of the short-term reward data is set to 1, and the long-term reward data The weight is set to 5, so that the second historical reward data can better reflect the long-term effect.
  • S420 Use the second historical state data as input, and use the second historical reward data as output to train the strategy function in the deep reinforcement learning model, so that the deep reinforcement learning model passes all the data based on the second historical state data.
  • the policy function selects the corresponding second historical action data, the outputted second historical reward data is the largest.
  • the loss function Loss2 of the above-mentioned deep reinforcement learning model is determined by the following formula:
  • Loss2 (Q(s t ,at )-(r t +max( ⁇ Q(s t +1 ,a))) 2 ;
  • s t represents the second historical state data at time t
  • at represents the second historical action data at time t
  • rt represents the second historical state data s t corresponding to the second historical action data at t.
  • the second historical reward data of ; Q(s t+1 , a) represents the second historical reward data obtained when the second historical action data is taken for the second historical state data at time t+1
  • is a constant.
  • FIG. 5 is a schematic diagram of an application of a deep reinforcement learning model according to Embodiment 1 of the present invention.
  • the deep learning model is connected to the encoder in the unbiased model, and the unbiased state data E(s) output by the encoder is used as the input data of the deep learning model, and finally the deep learning model outputs the same state data respectively.
  • s takes the corresponding reward value of different actions a.
  • Q(s, a0), Q(s, a1)...Q(s, an) in Figure 5 respectively represent the reward values obtained by taking different actions an.
  • the device 60 for determining a drug regimen for a patient may include or be divided into one or more program modules, one or more program modules is stored in a storage medium and executed by one or more processors to complete the present invention and implement the above-mentioned method for determining a patient's medication regimen.
  • the program module referred to in the present invention refers to a series of computer program instruction segments capable of accomplishing specific functions, and is more suitable for describing the execution process of the device 60 for determining a patient's medication regimen in the storage medium than the program itself. The following description will specifically introduce the functions of each program module in this embodiment:
  • the original state acquisition module 61 is suitable for acquiring the original state data of the target patient, and the original state data is used to characterize the disease characteristics of the patient;
  • an unbiased processing module 62 adapted to input the original state data into an unbiased model, so as to obtain unbiased state data from which the state distribution deviation is eliminated;
  • the deep learning module 63 is suitable for inputting the unbiased state data into a deep reinforcement learning model to obtain reward values corresponding to different medication regimens for the target patient; wherein the reward values are based on the unbiased state data The expected feedback effect after taking the medication regimen;
  • the regimen determination module 64 is adapted to determine the medication regimen of the target patient based on the maximum reward value.
  • the device for determining a patient's medication plan provided in this embodiment eliminates the deviation of action selection through the unbiased processing module, so that the estimation of the expected reward is more accurate, thereby ensuring that the deep learning module is fitted to obtain a more reasonable expected reward value, thereby improving the patient the therapeutic effect.
  • This embodiment also provides a computer device, such as a smart phone, a tablet computer, a notebook computer, a desktop computer, a rack server, a blade server, a tower server, or a cabinet server (including independent servers, or A server cluster composed of multiple servers), etc.
  • the computer device 70 in this embodiment at least includes but is not limited to: a memory 71 and a processor 72 that can be communicatively connected to each other through a system bus, as shown in FIG. 7 . It should be noted that FIG. 7 only shows a computer device 70 having components 71-72, but it should be understood that implementation of all of the illustrated components is not required, and more or fewer components may be implemented instead.
  • the memory 71 (ie, a readable storage medium) includes a flash memory, a hard disk, a multimedia card, a card-type memory (eg, SD or DX memory, etc.), random access memory (RAM), static random access memory (SRAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), Programmable Read Only Memory (PROM), Magnetic Memory, Magnetic Disk, Optical Disk, etc.
  • the memory 71 may be an internal storage unit of the computer device 70 , such as a hard disk or memory of the computer device 70 .
  • the memory 71 may also be an external storage device of the computer device 70, such as a plug-in hard disk, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital, SD) card, flash memory card (Flash Card), etc.
  • the memory 71 may also include both the internal storage unit of the computer device 70 and its external storage device.
  • the memory 71 is generally used to store the operating system and various application software installed on the computer device 70 , such as the program code of the device 60 for determining a patient's medication regimen in the first embodiment.
  • the memory 71 can also be used to temporarily store various types of data that have been output or will be output.
  • the processor 72 may be a central processing unit (CPU), a controller, a microcontroller, a microprocessor, or other data processing chips in some embodiments.
  • the processor 72 is typically used to control the overall operation of the computer device 70 .
  • the processor 72 is configured to run program codes or process data stored in the memory 71 , for example, run the device 60 for determining a patient's medication regimen to implement the method for determining a patient's medication regimen in the first embodiment.
  • This embodiment also provides a computer-readable storage medium, such as a flash memory, a hard disk, a multimedia card, a card-type memory (for example, SD or DX memory, etc.), random access memory (RAM), static random access memory (SRAM), only Read-only memory (ROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Programmable Read-Only Memory (PROM), magnetic memory, magnetic disk, optical disk, server, App application mall, etc., on which computer programs are stored, When the program is executed by the processor, the corresponding function is realized.
  • the computer-readable storage medium of this embodiment is used to store the device 60 for determining a patient's medication regimen, and when executed by a processor, implements the method for determining a patient's medication plan of the first embodiment.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Public Health (AREA)
  • Medical Informatics (AREA)
  • Primary Health Care (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Epidemiology (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Medicinal Chemistry (AREA)
  • Chemical & Material Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Pathology (AREA)
  • Molecular Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Databases & Information Systems (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Pharmacology & Pharmacy (AREA)
  • Toxicology (AREA)
  • Medical Treatment And Welfare Office Work (AREA)

Abstract

A method and device for determining a drug regimen of a patient, comprising the following steps: obtaining original state data of a target patient, the original state data being used for representing a condition feature of the patient (S100); inputting the original state data into an unbiased model to obtain unbiased state data in which state distribution deviation is eliminated (S200); inputting the unbiased state data into a deep reinforcement learning model to obtain reward values when different drug regimens are used for the target patient (S300); and determining a drug regimen of the target patient on the basis of a maximum reward value (S400). By introducing deviation elimination and learning reinforcement into the method and device determining a drug regimen of a patient, the selection deviation of a drug regimen is eliminated, such that the estimation of an expected reward is more accurate, thereby significantly enhancing the matching degree between a drug regimen and a patient.

Description

患者用药方案的确定方法及确定装置Method and device for determining medication regimen for patients
相关申请的交叉引用CROSS-REFERENCE TO RELATED APPLICATIONS
本申请申明享有2021年04月29日递交的申请号为202110474846.9、名称为“患者用药方案的确定方法及确定装置”的中国专利申请的优先权,该中国专利申请的整体内容以参考的方式结合在本申请中。This application declares that it enjoys the priority of the Chinese patent application with the application number 202110474846.9 and titled "Method for Determining Medication Recipe for Patients and Device for Determining", which was filed on April 29, 2021, and the entire contents of this Chinese patent application are incorporated by reference. in this application.
技术领域technical field
本申请涉及数字医疗技术领域,特别涉及一种能够实现精确医疗患者用药方案的确定方法及确定装置。The present application relates to the field of digital medical technology, and in particular, to a determination method and a determination device capable of realizing precise medical treatment of a patient's medication plan.
背景技术Background technique
由于患者身体条件的特异性,为了达到最佳治疗效果,医生对于同类疾病患者开具的药方往往是不同的。常规做法会按照一定的策略将患者划分为不同的群组,从而基于群组特征提供针对性的用药方案。发明人意识到,群组划分的准确性直接影响到患者的治疗效果。深度强化学习方法由于可以优化长期结局,可用于解决现实场景中越来越多的序列决策问题,目前已有现有技术通过深度强化学习方法进行患者分群。Due to the specificity of the patient's physical condition, in order to achieve the best therapeutic effect, doctors often prescribe different medicines for patients with similar diseases. Conventional practice divides patients into different groups according to certain strategies, so as to provide targeted medication regimens based on group characteristics. The inventors realized that the accuracy of group division directly affects the treatment effect of patients. Since deep reinforcement learning methods can optimize long-term outcomes, they can be used to solve more and more sequential decision-making problems in real-world scenarios. At present, there are existing technologies for patient grouping through deep reinforcement learning methods.
深度强化学习模型需要用到对大量患者进行分群用药的历史样本数据。由于这些历史样本数据通常是由医生决策的,不可避免地存在个人经验、知识储备等方面的偏倚性。当深度强化模型基于样本数据对特定状态下不同决策的价值进行估计时,这种偏倚性会使得对不同决策价值的估计产生偏差。The deep reinforcement learning model needs to use historical sample data of grouping medication for a large number of patients. Since these historical sample data are usually decided by doctors, there are inevitably biases in personal experience and knowledge reserves. When the deep reinforcement model estimates the value of different decisions in a specific state based on sample data, this bias will bias the estimate of the value of different decisions.
发明内容SUMMARY OF THE INVENTION
本申请的目的是提供一种能够消除患者用药方案确定过程中存在的个体特异性偏差的技术方案,以解决现有技术中存在的上述问题,从而提高患者用药方案确定过程的智能性和准确性。The purpose of this application is to provide a technical solution that can eliminate individual-specific deviations in the process of determining a patient's medication regimen, so as to solve the above-mentioned problems in the prior art, thereby improving the intelligence and accuracy of the process of determining a patient's drug regimen .
为实现上述目的,本申请提供一种患者用药方案的确定方法,包括以下步骤:To achieve the above object, the application provides a method for determining a patient's medication regimen, comprising the following steps:
获取目标患者的原始状态数据,所述原始状态数据用于表征所述患者的病情特征;obtaining raw state data of the target patient, the raw state data being used to characterize the disease characteristics of the patient;
将所述原始状态数据输入无偏模型,以得到消除了状态分布偏差的无偏状态数据;Inputting the original state data into an unbiased model to obtain unbiased state data with the state distribution deviation removed;
将所述无偏状态数据输入深度强化学习模型,获取对所述目标患者采取不同的用药方案时对应的奖励值;其中所述奖励值是基于所述无偏状态数据采取所述用药方案后的预期反馈效果;Input the unbiased state data into a deep reinforcement learning model to obtain the corresponding reward value when different medication regimens are taken for the target patient; wherein the reward value is based on the unbiased state data after the medication regimen is taken. Expected feedback effect;
基于最大的奖励值确定所述目标患者的用药方案。A medication regimen for the target patient is determined based on the maximum reward value.
为实现上述目的,本申请提供一种患者用药方案的确定设备,包括存储器、处理器以及存储在存储器上并可在处理器上运行的患者用药方案的确定程序,其中,所述处理器执行所述患者用药方案的确定程序时实现以下方法的步骤:In order to achieve the above object, the present application provides a device for determining a medication regimen for a patient, comprising a memory, a processor, and a program for determining the medication regimen for a patient stored on the memory and running on the processor, wherein the processor executes all the procedures. The steps of implementing the following method when describing a procedure for determining a patient's medication regimen:
获取目标患者的原始状态数据,所述原始状态数据用于表征所述患者的病情特征;obtaining raw state data of the target patient, the raw state data being used to characterize the disease characteristics of the patient;
将所述原始状态数据输入无偏模型,以得到消除了状态分布偏差的无偏状态数据;Inputting the original state data into an unbiased model to obtain unbiased state data with the state distribution deviation removed;
将所述无偏状态数据输入深度强化学习模型,获取对所述目标患者采取不同的用药方案时对应的奖励值;其中所述奖励值是基于所述无偏状态数据采取所述用药方案后的预期反馈效果;Input the unbiased state data into a deep reinforcement learning model to obtain the corresponding reward value when different medication regimens are taken for the target patient; wherein the reward value is based on the unbiased state data after the medication regimen is taken. Expected feedback effect;
基于最大的奖励值确定所述目标患者的用药方案。A medication regimen for the target patient is determined based on the maximum reward value.
为实现上述目的,本申请提供一种计算机可读存储介质,其中,所述存储介质上存储有患者用药方案的确定程序,所述患者用药方案的确定程序被处理器执行时实现以下步骤:To achieve the above object, the present application provides a computer-readable storage medium, wherein the storage medium stores a program for determining a patient's medication regimen, and the following steps are implemented when the program for determining the patient's drug regimen is executed by a processor:
获取目标患者的原始状态数据,所述原始状态数据用于表征所述患者的病情特征;obtaining raw state data of the target patient, the raw state data being used to characterize the disease characteristics of the patient;
将所述原始状态数据输入无偏模型,以得到消除了状态分布偏差的无偏状态数据;Inputting the original state data into an unbiased model to obtain unbiased state data with the state distribution deviation removed;
将所述无偏状态数据输入深度强化学习模型,获取对所述目标患者采取不同的用药方 案时对应的奖励值;其中所述奖励值是基于所述无偏状态数据采取所述用药方案后的预期反馈效果;Input the unbiased state data into a deep reinforcement learning model to obtain the corresponding reward value when different medication regimens are taken for the target patient; wherein the reward value is based on the unbiased state data after the medication regimen is taken. Expected feedback effect;
基于最大的奖励值确定所述目标患者的用药方案。A medication regimen for the target patient is determined based on the maximum reward value.
为实现上述目的,本申请还提供一种患者用药方案的确定装置,包括:To achieve the above purpose, the present application also provides a device for determining a patient's medication regimen, including:
原始状态获取模块,适用于获取目标患者的原始状态数据,所述原始状态数据用于表征所述患者的病情特征;an original state acquisition module, which is suitable for acquiring original state data of the target patient, where the original state data is used to characterize the disease characteristics of the patient;
无偏处理模块,适用于将所述原始状态数据输入无偏模型,以得到消除了状态分布偏差的无偏状态数据;an unbiased processing module, suitable for inputting the original state data into an unbiased model, so as to obtain unbiased state data from which the state distribution deviation is eliminated;
深度学习模块,适用于将所述无偏状态数据输入深度强化学习模型,获取对所述目标患者采取不同的用药方案时对应的奖励值;其中所述奖励值是基于所述无偏状态数据采取所述用药方案后的预期反馈效果;The deep learning module is suitable for inputting the unbiased state data into a deep reinforcement learning model to obtain the corresponding reward value when different medication regimens are taken for the target patient; wherein the reward value is taken based on the unbiased state data. The expected feedback effect after the medication regimen;
方案确定模块,适用于基于最大的奖励值确定所述目标患者的用药方案。The regimen determination module is adapted to determine the medication regimen of the target patient based on the maximum reward value.
本申请将因果推断领域中的偏差消除引入到强化学习的决策中,优化决策选择的长期累积回报的同时限制选择偏差带来的估计误差,提高了模型在现实使用中的准确性、安全性。通过将偏差消除和强化学习引入患者用药方案的确定方法及确定装置中,消除了用药方案的选择偏差,使预期效果的估计更加准确,从而增强了用药方案与患者之间的匹配度,治疗效果得到明显提升。The present application introduces the elimination of bias in the field of causal inference into the decision-making of reinforcement learning, optimizes the long-term cumulative return of decision-making selection and limits the estimation error caused by the selection bias, and improves the accuracy and security of the model in practical use. By introducing deviation elimination and reinforcement learning into the determination method and device for determining the medication plan of a patient, the selection bias of the medication plan is eliminated, and the estimation of the expected effect is more accurate, thereby enhancing the matching degree between the medication plan and the patient, and the treatment effect. significantly improved.
附图说明Description of drawings
图1为本发明的患者用药方案的确定方法实施例一的流程图;Fig. 1 is a flow chart of Embodiment 1 of a method for determining a patient's medication regimen according to the present invention;
图2为本发明实施例一的无偏模型结构示意图;2 is a schematic structural diagram of an unbiased model according to Embodiment 1 of the present invention;
图3为本发明实施例一对无偏模型进行训练示意性性流程图;3 is a schematic flowchart of training a pair of unbiased models according to an embodiment of the present invention;
图4为本发明实施例一对深度强化学习模型、进行训练示意性性流程图;FIG. 4 is a schematic flowchart of training a pair of deep reinforcement learning models according to an embodiment of the present invention;
图5为本发明实施例一的深度强化学习模型应用示意图;5 is a schematic diagram of an application of a deep reinforcement learning model according to Embodiment 1 of the present invention;
图6为本发明的患者用药方案的确定装置实施例一的程序模块示意图;6 is a schematic diagram of a program module of Embodiment 1 of a device for determining a patient's medication regimen according to the present invention;
图7为本发明的患者用药方案的确定装置实施例一的硬件结构示意图。FIG. 7 is a schematic diagram of the hardware structure of Embodiment 1 of the device for determining a medication regimen for a patient according to the present invention.
具体实施方式Detailed ways
应当理解,此处所描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。It should be understood that the specific embodiments described herein are only used to explain the present application, but not to limit the present application.
实施例一Example 1
本实施例提出一种患者用药方案的确定方法,该确定方法可适用于终端或服务器中。其中终端可以包括智能手机、笔记本电脑、平板电脑等智能设备,服务器可包括PC机、工作组服务器、企业级服务器等。请参阅图1,本实施例的确定方法包括以下步骤:This embodiment proposes a method for determining a patient's medication regimen, and the determining method can be applied to a terminal or a server. The terminals may include smart devices such as smart phones, notebook computers, and tablet computers, and the servers may include PCs, workgroup servers, and enterprise-level servers. Referring to FIG. 1, the determination method of this embodiment includes the following steps:
S100:获取目标患者的原始状态数据,所述原始状态数据用于表征所述患者的病情特征。S100: Acquire original state data of the target patient, where the original state data is used to characterize the disease characteristics of the patient.
深度强化学习用于实现从状态到动作之间的映射策略学习,根据每个动作对应的奖励值学习最优映射策略,根据策略选择最优动作,基于最优动作引起的状态变化得到延迟反馈值,通过迭代循环直至满足终止条件。在本发明实施例中,状态指的是目标患者的原始状态数据,动作则指的是具体的用药方案,奖励值指的是基于目标患者的状态采取特定用药方案后的预期反馈效果。其中原始状态数据可以包括患者的长期医疗随访记录,例如每次随访时的人口统计学信息、检验检查指标、用药史等数据内容。对于多条记录,可以根据不同时间进行加权求和以得到整体记录。Deep reinforcement learning is used to realize the mapping strategy learning from state to action, learn the optimal mapping strategy according to the reward value corresponding to each action, select the optimal action according to the strategy, and obtain the delayed feedback value based on the state change caused by the optimal action , by iterating through the loop until the termination condition is met. In the embodiment of the present invention, the state refers to the original state data of the target patient, the action refers to a specific medication plan, and the reward value refers to the expected feedback effect after taking a specific medication plan based on the state of the target patient. The raw state data may include long-term medical follow-up records of the patient, such as demographic information, inspection indicators, medication history, and other data content at each follow-up visit. For multiple records, weighted sums can be performed at different times to obtain the overall record.
S200:将所述原始状态数据输入无偏模型,以得到消除了状态分布偏差的无偏状态数据。S200: Input the original state data into an unbiased model to obtain unbiased state data from which the state distribution deviation is eliminated.
图2为本发明实施例一的无偏模型结构示意图。如图2所示,无偏模型包括编码器、解码器和预测器,其中编码器用于对输入的原始状态数据s进行编码以输出无偏状态数据E(s),解码器用于对无偏状态数据E(s)进行解码以得到与原始状态数据对应的解析状态数 据D(E(s)),预测器基于输入的解析状态数据D(E(s)),预测采取不同的动作a(即用药方案)时对应的奖励值R(s,a)。其中上述编码器、解码器和预测器均可以通过单层神经网络实现。FIG. 2 is a schematic structural diagram of an unbiased model according to Embodiment 1 of the present invention. As shown in Figure 2, the unbiased model includes an encoder, a decoder and a predictor, where the encoder is used to encode the input original state data s to output unbiased state data E(s), and the decoder is used to encode the unbiased state data E(s) The data E(s) is decoded to obtain the analytical state data D(E(s)) corresponding to the original state data, and the predictor predicts to take a different action a based on the input analytical state data D(E(s)) (i.e. The corresponding reward value R(s, a) when the drug regimen is used. The above encoder, decoder and predictor can all be implemented by a single-layer neural network.
一方面,本实施例提供的无偏模型通过将编码器与预测器相结合,可以使编码后的无偏状态数据E(s)具有预测奖励值R(s,a)的能力;另一方面,本实施例提供的无偏模型通过将编码器与解码器相结合,可以保留足够的原始输入信息,从而保证预测结果的准确性。可以理解,通过构造合适的损失函数对无偏模型进行训练,可以影响无偏模型在特定状态下选择特定动作的倾向性。关于损失函数的具体组成本实施例会在下文中详细描述。On the one hand, the unbiased model provided by this embodiment can make the encoded unbiased state data E(s) have the ability to predict the reward value R(s, a) by combining the encoder and the predictor; on the other hand , the unbiased model provided by this embodiment can retain enough original input information by combining the encoder and the decoder, thereby ensuring the accuracy of the prediction result. It can be understood that by constructing a suitable loss function to train the unbiased model, the tendency of the unbiased model to select a specific action in a specific state can be affected. Regarding the specific composition of the loss function, this embodiment will be described in detail below.
S300:将所述无偏状态数据输入深度强化学习模型,获取对所述目标患者采取不同的用药方案时对应的奖励值。S300: Input the unbiased state data into a deep reinforcement learning model to obtain the corresponding reward value when different medication regimens are adopted for the target patient.
本领域技术人员理解,深度强化模型涉及的输入输出数据包括状态、动作和奖励值。深度强化学习模型利用神经网络来拟合策略(policy),网络对于输入状态(state)采取策略(policy)后,输出各个动作(action)对应的预期奖励值(reward),其中最大的奖励值对应的action即为深度强化模型认为应该选择的最佳动作。在本实施例中,状态(state)指的是目标患者的原始状态数据的多维编码,动作(action)指的是用药方案的多维编码,预期奖励值(reward)指的是对特定原始状态数据采取特定用药方案的反馈效果的编码数据。需要说明的是,本实施例中的输入状态(state)可以是经由无偏模型中编码器输出的无偏状态数据E(s),具体可以包括对人口统计学信息、检验检查指标、用药史组成的多维向量编码数据,通过将无偏状态数据E(s)作为深度强化学习模型的输入状态数据,可以消除状态数据中的特异性,使得强化学习模型的输出结果更加准确。Those skilled in the art understand that the input and output data involved in the deep reinforcement model include states, actions and reward values. The deep reinforcement learning model uses a neural network to fit a policy. After the network adopts a policy for the input state (state), it outputs the expected reward value (reward) corresponding to each action (action), of which the largest reward value corresponds to The action is the best action that the deep reinforcement model thinks it should choose. In this embodiment, the state refers to the multi-dimensional encoding of the original state data of the target patient, the action refers to the multi-dimensional encoding of the medication regimen, and the expected reward value refers to the specific original state data. Coded data on the feedback effects of taking a specific medication regimen. It should be noted that the input state (state) in this embodiment may be the unbiased state data E(s) output by the encoder in the unbiased model, which may specifically include demographic information, inspection indicators, medication history The composed multi-dimensional vector encoding data, by using the unbiased state data E(s) as the input state data of the deep reinforcement learning model, can eliminate the specificity in the state data and make the output results of the reinforcement learning model more accurate.
S400:基于最大的奖励值确定所述目标患者的用药方案。S400: Determine the medication regimen of the target patient based on the maximum reward value.
本步骤可以基于患者的状态数据确定预期治疗效果最好的用药方案,从而更有针对性地为不同患者制定更加合适的治疗方案,明显改善治疗效果。本实施例可以根据奖励值的大小确定用药方案的治疗效果,例如数值最大的奖励值通常表示具有最佳的治疗效果。假设针对某患者的无偏状态数据E0,可采取的用药方案包括A1、A2、A3,深度强化学习模型输出针对每个用药方案的奖励值分别为R1、R2、R3,如果R1>R2>R3,则R1为具有最佳治疗效果的奖励值,与R1对应的用药方案A1则为最终确定的用药方案。In this step, a medication plan with the best expected therapeutic effect can be determined based on the patient's state data, so as to formulate a more appropriate treatment plan for different patients in a more targeted manner, and significantly improve the therapeutic effect. In this embodiment, the therapeutic effect of the drug regimen can be determined according to the size of the reward value, for example, the reward value with the largest value generally indicates the best therapeutic effect. Assuming that for a patient's unbiased state data E0, the available drug regimens include A1, A2, and A3. The deep reinforcement learning model outputs the reward values for each drug regimen as R1, R2, and R3, respectively. If R1>R2>R3 , then R1 is the reward value with the best therapeutic effect, and the medication plan A1 corresponding to R1 is the finalized medication plan.
通过上述步骤,本方案提供的无偏模型可以在保留患者数据原本信息的基础上最大程度去除患者数据中存在的偏倚量,从而保证深度强化学习模型中输入数据的客观性,使得深度强化学习模型对于患者分类的输出结果更加准确公正。Through the above steps, the unbiased model provided by this solution can remove the bias existing in the patient data to the greatest extent on the basis of retaining the original information of the patient data, thereby ensuring the objectivity of the input data in the deep reinforcement learning model and making the deep reinforcement learning model. The output of patient classification is more accurate and fair.
图3示出了本发明实施例一对无偏模型进行训练示意性性流程图。如图3所示,训练无偏模型包括以下步骤:FIG. 3 shows a schematic flowchart of training a pair of unbiased models according to an embodiment of the present invention. As shown in Figure 3, training an unbiased model involves the following steps:
S310:获取多个患者的第一历史样本数据,所述第一历史样本数据包括第一历史状态数据、第一历史动作数据和第一历史奖励数据。S310: Acquire first historical sample data of multiple patients, where the first historical sample data includes first historical state data, first historical action data, and first historical reward data.
其中,上述第一历史状态数据包括患者的人口统计学信息、检验检查指标和用药史;第一历史动作数据包括医生针对所述患者开具的用药方案,第一历史奖励数据包括所述患者采取所述用药方案后的健康反馈信息。Wherein, the above-mentioned first historical status data includes the patient's demographic information, inspection and inspection indicators and medication history; the first historical action data includes the medication plan prescribed by the doctor for the patient, and the first historical reward data includes the patient's taking Health feedback information after the described medication regimen.
S320:将第一历史状态数据作为编码器的输入,将第一历史奖励数据作为预测器的输出训练无偏模型,以确定所述编码器、所述解码器和所述预测器中的权重参数。S320: Use the first historical state data as the input of the encoder, and use the first historical reward data as the output of the predictor to train an unbiased model to determine the weight parameters in the encoder, the decoder and the predictor .
S330:当无偏模型的损失函数收敛于预设阈值时,所述无偏模型的训练过程结束。S330: When the loss function of the unbiased model converges to the preset threshold, the training process of the unbiased model ends.
在一个示例中,上述无偏模型的损失函数Loss1由以下算式确定:In one example, the loss function Loss1 of the above unbiased model is determined by the following equation:
Loss1=Lce+Linf+Lr;Loss1=Lce+Linf+Lr;
Lce=∑ a∈Ap(a)*log[p(a)/p(a|E(s))]; Lce=∑ a∈A p(a)*log[p(a)/p(a|E(s))];
Figure PCTCN2021097139-appb-000001
Figure PCTCN2021097139-appb-000001
Figure PCTCN2021097139-appb-000002
Figure PCTCN2021097139-appb-000002
其中,s代表当前第一历史状态数据,E(s)代表s经过编码器后输出的第一历史无偏状 态数据,a表示当前第一历史动作数据,A表示所有第一历史动作数据的集合,p(a)表示在所有第一历史动作数据中选择当前历史状态数据的概率,p(a|E(s))表示在当前第一历史无偏状态数据下采取当前第一历史动作数据的概率,D(E(s))表示第一历史无偏状态数据经过解码器后输出的第一历史解析状态数据,
Figure PCTCN2021097139-appb-000003
表示对x的L2正则化,r表示当前第一历史奖励数据,R(E(s),a)表示在当前第一历史无偏状态数据下采取当前第一历史动作数据对应的第一历史奖励数据。
Among them, s represents the current first historical state data, E(s) represents the first historical unbiased state data output by s after passing through the encoder, a represents the current first historical action data, and A represents the set of all the first historical action data , p(a) represents the probability of selecting the current historical state data among all the first historical action data, p(a|E(s)) represents the probability of taking the current first historical action data under the current first historical unbiased state data probability, D(E(s)) represents the first historical analytical state data output after the first historical unbiased state data passes through the decoder,
Figure PCTCN2021097139-appb-000003
Represents the L2 regularization of x, r represents the current first historical reward data, and R(E(s), a) represents the first historical reward corresponding to the current first historical action data under the current first historical unbiased state data data.
上式中,Lce是一个KL散度损失函数,通过让编码后的空间中采取各个action的条件概率逼近全体样本中采取各个action的比例,以使在编码后的空间中选择action的倾向与输入无关,从而去除action选择的偏倚。Linf的目的是让编码后的空间保留足够的原本的状态信息;Lr的目的是让编码后的空间拥有预测reward的能力,即在编码后的空间中加入了reward的信息。使用这三个损失函数,令编码后的无偏状态数据E(s)去除了在特定状态下选择action的倾向,同时保留足够的原输入信息和reward预测能力。这样,将无偏状态数据E(s)用于深度强化学习模型的输入可以得到更加无偏的预期奖励值。In the above formula, Lce is a KL divergence loss function. By making the conditional probability of each action taken in the encoded space approximate to the proportion of each action taken in the whole sample, the propensity to select an action in the encoded space is related to the input. irrelevant, thereby removing the bias of action selection. The purpose of Linf is to allow the encoded space to retain enough original state information; the purpose of Lr is to enable the encoded space to have the ability to predict reward, that is, to add reward information to the encoded space. Using these three loss functions, the encoded unbiased state data E(s) removes the tendency to select actions in a specific state, while retaining sufficient original input information and reward prediction ability. In this way, using the unbiased state data E(s) as the input to the deep reinforcement learning model can lead to a more unbiased expected reward value.
图4示出了本发明实施例一对深度强化学习模型进行训练示意性性流程图。如图4所示,深度强化学习模型包括以下步骤:FIG. 4 shows a schematic flowchart of training a pair of deep reinforcement learning models according to an embodiment of the present invention. As shown in Figure 4, the deep reinforcement learning model includes the following steps:
S410:获取多个患者的第二历史样本数据,所述第二历史样本数据包括第二历史状态数据、第二历史动作数据和第二历史奖励数据。S410: Acquire second historical sample data of multiple patients, where the second historical sample data includes second historical state data, second historical action data, and second historical reward data.
其中,上述第二历史状态数据包括所述患者的人口统计学信息、检验检查指标和用药史;上述第二历史动作数据包括医生针对所述患者开具的用药方案,上述第二历史奖励数据包括所述患者采取所述用药方案后的健康反馈信息。具体的,第二历史奖励数据可以包括短期奖励数据和长期奖励数据,其中长期奖励值的权重高于所述短期奖励值的权重。这里的短期奖励数据和长期奖励数据一按照随访时间确定的,例如规定一年以内的反馈信息属于短期奖励数据,一年以上的反馈信息属于长期奖励数据。对于患者而言,治疗后的长期效果显然比短期效果更为重要,因此本实施例对于长期奖励数据设置了更高的权重值,例如将短期奖励数据的权重设为1,将长期奖励数据的权重设为5,从而使得第二历史奖励数据更能体现出长期效果。Wherein, the second historical state data includes demographic information, inspection and inspection indicators and medication history of the patient; the second historical action data includes the medication plan prescribed by the doctor for the patient, and the second historical reward data includes all The health feedback information of the patient after taking the medication regimen. Specifically, the second historical reward data may include short-term reward data and long-term reward data, wherein the weight of the long-term reward value is higher than the weight of the short-term reward value. The short-term reward data and long-term reward data here are determined according to the follow-up time. For example, it is stipulated that feedback information within one year belongs to short-term reward data, and feedback information of more than one year belongs to long-term reward data. For patients, the long-term effect after treatment is obviously more important than the short-term effect, so this embodiment sets a higher weight value for the long-term reward data, for example, the weight of the short-term reward data is set to 1, and the long-term reward data The weight is set to 5, so that the second historical reward data can better reflect the long-term effect.
S420:将第二历史状态数据作为输入,将第二历史奖励数据作为输出以训练所述深度强化学习模型中的策略函数,以使所述深度强化学习模型基于所述第二历史状态数据通过所述策略函数选择对应的第二历史动作数据时输出的所述第二历史奖励数据最大。S420: Use the second historical state data as input, and use the second historical reward data as output to train the strategy function in the deep reinforcement learning model, so that the deep reinforcement learning model passes all the data based on the second historical state data. When the policy function selects the corresponding second historical action data, the outputted second historical reward data is the largest.
S430:当所述深度强化学习模型的损失函数收敛于预设阈值时,所述训练过程结束。S430: When the loss function of the deep reinforcement learning model converges to a preset threshold, the training process ends.
在一个示例中,上述述深度强化学习模型的损失函数Loss2由以下算式确定:In one example, the loss function Loss2 of the above-mentioned deep reinforcement learning model is determined by the following formula:
Loss2=(Q(s t,a t)-(r t+max(γ×Q(s t+1,a))) 2Loss2=(Q(s t ,at )-(r t +max(γ×Q(s t +1 ,a))) 2 ;
上式中,s t表示在t时刻的第二历史状态数据,a t表示在t时刻的第二历史动作数据,r t表示对第二历史状态数据s t采取第二历史动作数据a t对应的第二历史奖励数据;Q(s t+1,a)表示对t+1时刻的第二历史状态数据采取第二历史动作数据时得到的第二历史奖励数据,γ为常数。 In the above formula, s t represents the second historical state data at time t , at represents the second historical action data at time t , and rt represents the second historical state data s t corresponding to the second historical action data at t. The second historical reward data of ; Q(s t+1 , a) represents the second historical reward data obtained when the second historical action data is taken for the second historical state data at time t+1, and γ is a constant.
图5为本发明实施例一的深度强化学习模型应用示意图。如图5所示,深度学习模型与无偏模型中的编码器相连,编码器输出的无偏状态数据E(s)作为深度学习模型的输入数据,最终通过深度学习模型分别输出对于同一个状态s采取不同动作a的对应奖励值。其中图5中的Q(s,a0)、Q(s,a1)……Q(s,an)分别代表采取不同动作an得到的奖励值。FIG. 5 is a schematic diagram of an application of a deep reinforcement learning model according to Embodiment 1 of the present invention. As shown in Figure 5, the deep learning model is connected to the encoder in the unbiased model, and the unbiased state data E(s) output by the encoder is used as the input data of the deep learning model, and finally the deep learning model outputs the same state data respectively. s takes the corresponding reward value of different actions a. Among them, Q(s, a0), Q(s, a1)...Q(s, an) in Figure 5 respectively represent the reward values obtained by taking different actions an.
请继续参阅图6,示出了一种患者用药方案的确定装置,在本实施例中,患者用药方案的确定装置60可以包括或被分割成一个或多个程序模块,一个或者多个程序模块被存储于存储介质中,并由一个或多个处理器所执行,以完成本发明,并可实现上述患者用药方案的确定方法。本发明所称的程序模块是指能够完成特定功能的一系列计算机程序指令 段,比程序本身更适合于描述患者用药方案的确定装置60在存储介质中的执行过程。以下描述将具体介绍本实施例各程序模块的功能:Please continue to refer to FIG. 6 , which shows a device for determining a medication regimen of a patient. In this embodiment, the device 60 for determining a drug regimen for a patient may include or be divided into one or more program modules, one or more program modules is stored in a storage medium and executed by one or more processors to complete the present invention and implement the above-mentioned method for determining a patient's medication regimen. The program module referred to in the present invention refers to a series of computer program instruction segments capable of accomplishing specific functions, and is more suitable for describing the execution process of the device 60 for determining a patient's medication regimen in the storage medium than the program itself. The following description will specifically introduce the functions of each program module in this embodiment:
原始状态获取模块61,适用于获取目标患者的原始状态数据,所述原始状态数据用于表征所述患者的病情特征;The original state acquisition module 61 is suitable for acquiring the original state data of the target patient, and the original state data is used to characterize the disease characteristics of the patient;
无偏处理模块62,适用于将所述原始状态数据输入无偏模型,以得到消除了状态分布偏差的无偏状态数据;an unbiased processing module 62, adapted to input the original state data into an unbiased model, so as to obtain unbiased state data from which the state distribution deviation is eliminated;
深度学习模块63,适用于将所述无偏状态数据输入深度强化学习模型,获取对所述目标患者采取不同的用药方案时对应的奖励值;其中所述奖励值是基于所述无偏状态数据采取所述用药方案后的预期反馈效果;The deep learning module 63 is suitable for inputting the unbiased state data into a deep reinforcement learning model to obtain reward values corresponding to different medication regimens for the target patient; wherein the reward values are based on the unbiased state data The expected feedback effect after taking the medication regimen;
方案确定模块64,适用于基于最大的奖励值确定所述目标患者的用药方案。The regimen determination module 64 is adapted to determine the medication regimen of the target patient based on the maximum reward value.
本实施例提供的患者用药方案的确定装置,通过无偏处理模块消除了动作选择的偏差,使预期奖励的估计更加准确,从而确保深度学习模块拟合得到更加合理的预期奖励值,从而改善患者的治疗效果。The device for determining a patient's medication plan provided in this embodiment eliminates the deviation of action selection through the unbiased processing module, so that the estimation of the expected reward is more accurate, thereby ensuring that the deep learning module is fitted to obtain a more reasonable expected reward value, thereby improving the patient the therapeutic effect.
本实施例还提供一种计算机设备,如可以执行程序的智能手机、平板电脑、笔记本电脑、台式计算机、机架式服务器、刀片式服务器、塔式服务器或机柜式服务器(包括独立的服务器,或者多个服务器所组成的服务器集群)等。本实施例的计算机设备70至少包括但不限于:可通过系统总线相互通信连接的存储器71、处理器72,如图7所示。需要指出的是,图7仅示出了具有组件71-72的计算机设备70,但是应理解的是,并不要求实施所有示出的组件,可以替代的实施更多或者更少的组件。This embodiment also provides a computer device, such as a smart phone, a tablet computer, a notebook computer, a desktop computer, a rack server, a blade server, a tower server, or a cabinet server (including independent servers, or A server cluster composed of multiple servers), etc. The computer device 70 in this embodiment at least includes but is not limited to: a memory 71 and a processor 72 that can be communicatively connected to each other through a system bus, as shown in FIG. 7 . It should be noted that FIG. 7 only shows a computer device 70 having components 71-72, but it should be understood that implementation of all of the illustrated components is not required, and more or fewer components may be implemented instead.
本实施例中,存储器71(即可读存储介质)包括闪存、硬盘、多媒体卡、卡型存储器(例如,SD或DX存储器等)、随机访问存储器(RAM)、静态随机访问存储器(SRAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、可编程只读存储器(PROM)、磁性存储器、磁盘、光盘等。在一些实施例中,存储器71可以是计算机设备70的内部存储单元,例如该计算机设备70的硬盘或内存。在另一些实施例中,存储器71也可以是计算机设备70的外部存储设备,例如该计算机设备70上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。当然,存储器71还可以既包括计算机设备70的内部存储单元也包括其外部存储设备。本实施例中,存储器71通常用于存储安装于计算机设备70的操作系统和各类应用软件,例如实施例一的患者用药方案的确定装置60的程序代码等。此外,存储器71还可以用于暂时地存储已经输出或者将要输出的各类数据。In this embodiment, the memory 71 (ie, a readable storage medium) includes a flash memory, a hard disk, a multimedia card, a card-type memory (eg, SD or DX memory, etc.), random access memory (RAM), static random access memory (SRAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), Programmable Read Only Memory (PROM), Magnetic Memory, Magnetic Disk, Optical Disk, etc. In some embodiments, the memory 71 may be an internal storage unit of the computer device 70 , such as a hard disk or memory of the computer device 70 . In other embodiments, the memory 71 may also be an external storage device of the computer device 70, such as a plug-in hard disk, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital, SD) card, flash memory card (Flash Card), etc. Of course, the memory 71 may also include both the internal storage unit of the computer device 70 and its external storage device. In this embodiment, the memory 71 is generally used to store the operating system and various application software installed on the computer device 70 , such as the program code of the device 60 for determining a patient's medication regimen in the first embodiment. In addition, the memory 71 can also be used to temporarily store various types of data that have been output or will be output.
处理器72在一些实施例中可以是中央处理器(Central Processing Unit,CPU)、控制器、微控制器、微处理器、或其他数据处理芯片。该处理器72通常用于控制计算机设备70的总体操作。本实施例中,处理器72用于运行存储器71中存储的程序代码或者处理数据,例如运行患者用药方案的确定装置60,以实现实施例一的患者用药方案的确定方法。The processor 72 may be a central processing unit (CPU), a controller, a microcontroller, a microprocessor, or other data processing chips in some embodiments. The processor 72 is typically used to control the overall operation of the computer device 70 . In this embodiment, the processor 72 is configured to run program codes or process data stored in the memory 71 , for example, run the device 60 for determining a patient's medication regimen to implement the method for determining a patient's medication regimen in the first embodiment.
本实施例还提供一种计算机可读存储介质,如闪存、硬盘、多媒体卡、卡型存储器(例如,SD或DX存储器等)、随机访问存储器(RAM)、静态随机访问存储器(SRAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、可编程只读存储器(PROM)、磁性存储器、磁盘、光盘、服务器、App应用商城等等,其上存储有计算机程序,程序被处理器执行时实现相应功能。本实施例的计算机可读存储介质用于存储患者用药方案的确定装置60,被处理器执行时实现实施例一的患者用药方案的确定方法。This embodiment also provides a computer-readable storage medium, such as a flash memory, a hard disk, a multimedia card, a card-type memory (for example, SD or DX memory, etc.), random access memory (RAM), static random access memory (SRAM), only Read-only memory (ROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Programmable Read-Only Memory (PROM), magnetic memory, magnetic disk, optical disk, server, App application mall, etc., on which computer programs are stored, When the program is executed by the processor, the corresponding function is realized. The computer-readable storage medium of this embodiment is used to store the device 60 for determining a patient's medication regimen, and when executed by a processor, implements the method for determining a patient's medication plan of the first embodiment.
需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者系统不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者系统所 固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、方法、物品或者系统中还存在另外的相同要素。It should be noted that, herein, the terms "comprising", "comprising" or any other variation thereof are intended to encompass non-exclusive inclusion, such that a process, method, article or system comprising a series of elements includes not only those elements, It also includes other elements not expressly listed or inherent to such a process, method, article or system. Without further limitation, an element qualified by the phrase "comprising a..." does not preclude the presence of additional identical elements in the process, method, article or system that includes the element.
上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。在列举了若干装置的单元权利要求中,这些装置中的若干个可以是通过同一个硬件项来具体体现。词语第一、第二、以及第三等的使用不表示任何顺序,可将这些词语解释为标识。The above-mentioned serial numbers of the embodiments of the present application are only for description, and do not represent the advantages or disadvantages of the embodiments. In a unit claim enumerating several means, several of these means may be embodied by one and the same item of hardware. The use of the words first, second, and third, etc. do not denote any order, and these words may be construed as identifications.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如只读存储器镜像(Read Only Memory image,ROM)/随机存取存储器(Random Access Memory,RAM)、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,空调器,或者网络设备等)执行本申请各个实施例所述的方法。From the description of the above embodiments, those skilled in the art can clearly understand that the methods of the above embodiments can be implemented by means of software plus a necessary general hardware platform, and of course hardware can also be used, but in many cases the former is better implementation. Based on this understanding, the technical solutions of the present application can be embodied in the form of software products in essence or the parts that make contributions to the prior art, and the computer software products are stored in a storage medium (such as a read-only memory image (Read Only Memory image, ROM)/Random Access Memory (Random Access Memory, RAM), magnetic disk, CD-ROM), including several instructions to make a terminal device (can be a mobile phone, computer, server, air conditioner, or network device) etc.) to perform the methods described in the various embodiments of the present application.
以上仅为本申请的优选实施例,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。The above are only the preferred embodiments of the present application, and are not intended to limit the patent scope of the present application. Any equivalent structure or equivalent process transformation made by using the contents of the description and drawings of the present application, or directly or indirectly applied in other related technical fields , are similarly included within the scope of patent protection of this application.

Claims (20)

  1. 一种患者用药方案的确定方法,其中,包括以下步骤:A method for determining a patient's medication regimen, comprising the following steps:
    获取目标患者的原始状态数据,所述原始状态数据用于表征所述患者的病情特征;obtaining raw state data of the target patient, the raw state data being used to characterize the disease characteristics of the patient;
    将所述原始状态数据输入无偏模型,以得到消除了状态分布偏差的无偏状态数据;Inputting the original state data into an unbiased model to obtain unbiased state data with the state distribution deviation removed;
    将所述无偏状态数据输入深度强化学习模型,获取对所述目标患者采取不同的用药方案时对应的奖励值;其中所述奖励值是基于所述无偏状态数据采取所述用药方案后的预期反馈效果;Input the unbiased state data into a deep reinforcement learning model to obtain the corresponding reward value when different medication regimens are taken for the target patient; wherein the reward value is based on the unbiased state data after the medication regimen is taken. Expected feedback effect;
    基于最大的奖励值确定所述目标患者的用药方案。A medication regimen for the target patient is determined based on the maximum reward value.
  2. 根据权利要求1所述的患者用药方案的确定方法,其中,所述无偏模型包括编码器、解码器和预测器,所述编码器用于对所述原始状态数据进行编码以输出无偏状态数据,所述解码器用于对所述无偏状态数据进行解码以得到与所述原始状态数据对应的解析状态数据,所述预测器基于所述解析状态数据,预测采取不同的用药方案时对应的奖励值;其中,所述编码器、所述解码器和所述预测器均为单层神经网络。The method for determining a patient's medication regimen according to claim 1, wherein the unbiased model comprises an encoder, a decoder and a predictor, and the encoder is used to encode the raw state data to output unbiased state data , the decoder is used to decode the unbiased state data to obtain analytical state data corresponding to the original state data, and the predictor predicts the corresponding rewards when different medication regimens are taken based on the analytical state data value; wherein, the encoder, the decoder, and the predictor are all single-layer neural networks.
  3. 根据权利要求2所述的患者用药方案的确定方法,其中,所述无偏模型的训练过程包括以下步骤:The method for determining a patient's medication regimen according to claim 2, wherein the training process of the unbiased model comprises the following steps:
    获取多个患者的第一历史样本数据,所述第一历史样本数据包括第一历史状态数据、第一历史动作数据和第一历史奖励数据;其中,所述第一历史状态数据包括所述患者的人口统计学信息、检验检查指标和用药史;所述第一历史动作数据包括医生针对所述患者开具的用药方案,所述第一历史奖励数据包括所述患者采取所述用药方案后的健康反馈信息;Obtaining first historical sample data of a plurality of patients, the first historical sample data includes first historical state data, first historical action data and first historical reward data; wherein, the first historical state data includes the patient Demographic information, inspection indicators and medication history; the first historical action data includes the medication plan prescribed by the doctor for the patient, and the first historical reward data includes the patient's health after taking the medication plan Feedback;
    将所述第一历史状态数据作为所述编码器的输入,将所述第一历史奖励数据作为所述预测器的输出训练所述无偏模型,以确定所述编码器、所述解码器和所述预测器中的权重参数;training the unbiased model using the first historical state data as the input of the encoder and the first historical reward data as the output of the predictor to determine the encoder, the decoder and the weight parameters in the predictor;
    当所述无偏模型的损失函数收敛于预设阈值时,所述无偏模型的训练过程结束。When the loss function of the unbiased model converges to a preset threshold, the training process of the unbiased model ends.
  4. 根据权利要求3所述的患者用药方案的确定方法,其中,所述无偏模型的损失函数Loss1由以下算式确定:The method for determining a patient's medication regimen according to claim 3, wherein the loss function Loss1 of the unbiased model is determined by the following formula:
    Loss1=Lce+Linf+Lr;Loss1=Lce+Linf+Lr;
    Lce=∑ a∈Ap(a)*log[p(a)/p(a|E(s))]; Lce=∑ a∈A p(a)*log[p(a)/p(a|E(s))];
    Figure PCTCN2021097139-appb-100001
    Figure PCTCN2021097139-appb-100001
    Figure PCTCN2021097139-appb-100002
    Figure PCTCN2021097139-appb-100002
    其中,s代表当前第一历史状态数据,E(s)代表s经过编码器后输出的第一历史无偏状态数据,a表示当前第一历史动作数据,A表示所有第一历史动作数据的集合,p(a)表示在所有第一历史动作数据中选择当前历史状态数据的概率,p(a|E(s))表示在当前第一历史无偏状态数据下采取当前第一历史动作数据的概率,D(E(s))表示第一历史无偏状态数据经过解码器后输出的第一历史解析状态数据,
    Figure PCTCN2021097139-appb-100003
    表示对x的L2正则化,r表示当前第一历史奖励数据,R(E(s),a)表示在当前第一历史无偏状态数据下采取当前第一历史动作数据对应的第一历史奖励数据。
    Among them, s represents the current first historical state data, E(s) represents the first historical unbiased state data output by s after passing through the encoder, a represents the current first historical action data, and A represents the set of all the first historical action data , p(a) represents the probability of selecting the current historical state data among all the first historical action data, p(a|E(s)) represents the probability of taking the current first historical action data under the current first historical unbiased state data probability, D(E(s)) represents the first historical analytical state data output after the first historical unbiased state data passes through the decoder,
    Figure PCTCN2021097139-appb-100003
    Represents the L2 regularization of x, r represents the current first historical reward data, and R(E(s), a) represents the first historical reward corresponding to the current first historical action data under the current first historical unbiased state data data.
  5. 根据权利要求1所述的患者用药方案的确定方法,其中,所述深度强化学习模型的训练过程,包括以下步骤:The method for determining a patient's medication regimen according to claim 1, wherein the training process of the deep reinforcement learning model comprises the following steps:
    获取多个患者的第二历史样本数据,所述第二历史样本数据包括第二历史状态数据、第二历史动作数据和第二历史奖励数据;其中,所述第二历史状态数据包括所述患者的人口统计学信息、检验检查指标和用药史;所述第二历史动作数据包括医生针对所述患者开具的用药方案,所述第二历史奖励数据包括所述患者采取所述用药方案后的健康反馈信息;Acquiring second historical sample data of a plurality of patients, the second historical sample data includes second historical state data, second historical action data and second historical reward data; wherein the second historical state data includes the patient Demographic information, inspection indicators and medication history; the second historical action data includes the medication plan prescribed by the doctor for the patient, and the second historical reward data includes the patient's health after taking the medication plan Feedback;
    将所述第二历史状态数据作为输入,将所述第二历史奖励数据作为输出以训练所述深度强化学习模型中的策略函数,以使所述深度强化学习模型基于所述第二历史状态数据通过所述策略函数选择对应的第二历史动作数据时输出的所述第二历史奖励数据最大;Using the second historical state data as input and the second historical reward data as output to train a policy function in the deep reinforcement learning model so that the deep reinforcement learning model is based on the second historical state data The second historical reward data outputted when the corresponding second historical action data is selected by the strategy function is the largest;
    当所述深度强化学习模型的损失函数收敛于预设阈值时,所述训练过程结束。When the loss function of the deep reinforcement learning model converges to a preset threshold, the training process ends.
  6. 根据权利要求5所述的患者用药方案的确定方法,其中,所述深度强化学习模型的损失函数Loss2由以下算式确定:The method for determining a patient's medication regimen according to claim 5, wherein the loss function Loss2 of the deep reinforcement learning model is determined by the following formula:
    Loss2=(Q(s t,a t)-(r t+max(γ×Q(s t+1,a))) 2Loss2=(Q(s t ,at )-(r t +max(γ×Q(s t +1 ,a))) 2 ;
    上式中,s t表示在t时刻的第二历史状态数据,a t表示在t时刻的第二历史动作数据,r t表示对第二历史状态数据s t采取第二历史动作数据a t对应的第二历史奖励数据;Q(s t+1,a)表示对t+1时刻的第二历史状态数据采取第二历史动作数据时得到的第二历史奖励数据,γ为常数。 In the above formula, s t represents the second historical state data at time t , at represents the second historical action data at time t , and rt represents the second historical state data s t corresponding to the second historical action data at t. The second historical reward data of ; Q(s t+1 , a) represents the second historical reward data obtained when the second historical action data is taken for the second historical state data at time t+1, and γ is a constant.
  7. 根据权利要求5所述的患者用药方案的确定方法,其中,所述第二历史奖励数据包括短期奖励数据和长期奖励数据,所述长期奖励数据的权重高于所述短期奖励数据的权重。The method for determining a patient's medication regimen according to claim 5, wherein the second historical reward data includes short-term reward data and long-term reward data, and the weight of the long-term reward data is higher than the weight of the short-term reward data.
  8. 一种患者用药方案的确定设备,包括存储器、处理器以及存储在存储器上并可在处理器上运行的患者用药方案的确定程序,其中,所述处理器执行所述患者用药方案的确定程序时实现以下方法的步骤:A device for determining a patient's medication regimen, comprising a memory, a processor, and a patient's medication plan determination program stored on the memory and executable on the processor, wherein, when the processor executes the patient's medication plan determination program Steps to implement the following method:
    获取目标患者的原始状态数据,所述原始状态数据用于表征所述患者的病情特征;obtaining raw state data of the target patient, the raw state data being used to characterize the disease characteristics of the patient;
    将所述原始状态数据输入无偏模型,以得到消除了状态分布偏差的无偏状态数据;Inputting the original state data into an unbiased model to obtain unbiased state data with the state distribution deviation removed;
    将所述无偏状态数据输入深度强化学习模型,获取对所述目标患者采取不同的用药方案时对应的奖励值;其中所述奖励值是基于所述无偏状态数据采取所述用药方案后的预期反馈效果;Input the unbiased state data into a deep reinforcement learning model to obtain the corresponding reward value when different medication regimens are taken for the target patient; wherein the reward value is based on the unbiased state data after the medication regimen is taken. Expected feedback effect;
    基于最大的奖励值确定所述目标患者的用药方案。A medication regimen for the target patient is determined based on the maximum reward value.
  9. 根据权利要求8所述的患者用药方案的确定设备,其中,所述无偏模型包括编码器、解码器和预测器,所述编码器用于对所述原始状态数据进行编码以输出无偏状态数据,所述解码器用于对所述无偏状态数据进行解码以得到与所述原始状态数据对应的解析状态数据,所述预测器基于所述解析状态数据,预测采取不同的用药方案时对应的奖励值;其中,所述编码器、所述解码器和所述预测器均为单层神经网络。The device for determining a medication regimen of a patient according to claim 8, wherein the unbiased model comprises an encoder, a decoder and a predictor, the encoder is used to encode the raw state data to output unbiased state data , the decoder is used to decode the unbiased state data to obtain analytical state data corresponding to the original state data, and the predictor predicts the corresponding rewards when different medication regimens are taken based on the analytical state data value; wherein, the encoder, the decoder and the predictor are all single-layer neural networks.
  10. 根据权利要求9所述的患者用药方案的确定设备,其中,所述无偏模型的训练过程包括以下步骤:The device for determining a patient's medication regimen according to claim 9, wherein the training process of the unbiased model comprises the following steps:
    获取多个患者的第一历史样本数据,所述第一历史样本数据包括第一历史状态数据、第一历史动作数据和第一历史奖励数据;其中,所述第一历史状态数据包括所述患者的人口统计学信息、检验检查指标和用药史;所述第一历史动作数据包括医生针对所述患者开具的用药方案,所述第一历史奖励数据包括所述患者采取所述用药方案后的健康反馈信息;Obtaining first historical sample data of a plurality of patients, the first historical sample data includes first historical state data, first historical action data and first historical reward data; wherein, the first historical state data includes the patient Demographic information, inspection indicators and medication history; the first historical action data includes the medication plan prescribed by the doctor for the patient, and the first historical reward data includes the patient's health after taking the medication plan Feedback;
    将所述第一历史状态数据作为所述编码器的输入,将所述第一历史奖励数据作为所述预测器的输出训练所述无偏模型,以确定所述编码器、所述解码器和所述预测器中的权重参数;training the unbiased model using the first historical state data as the input of the encoder and the first historical reward data as the output of the predictor to determine the encoder, the decoder and the weight parameters in the predictor;
    当所述无偏模型的损失函数收敛于预设阈值时,所述无偏模型的训练过程结束。When the loss function of the unbiased model converges to a preset threshold, the training process of the unbiased model ends.
  11. 根据权利要求10所述的患者用药方案的确定设备,其中,所述无偏模型的损失函数Loss1由以下算式确定:The device for determining a patient's medication regimen according to claim 10, wherein the loss function Loss1 of the unbiased model is determined by the following formula:
    Loss1=Lce+Linf+Lr;Loss1=Lce+Linf+Lr;
    Lce=∑ a∈Ap(a)*log[p(a)/p(a|E(s))]; Lce=∑ a∈A p(a)*log[p(a)/p(a|E(s))];
    Figure PCTCN2021097139-appb-100004
    Figure PCTCN2021097139-appb-100004
    Figure PCTCN2021097139-appb-100005
    Figure PCTCN2021097139-appb-100005
    其中,s代表当前第一历史状态数据,E(s)代表s经过编码器后输出的第一历史无偏状态数据,a表示当前第一历史动作数据,A表示所有第一历史动作数据的集合,p(a)表示在所有第一历史动作数据中选择当前历史状态数据的概率,p(a|E(s))表示在当前第一历史无偏状态数据下采取当前第一历史动作数据的概率,D(E(s))表示第一历史无偏状态 数据经过解码器后输出的第一历史解析状态数据,
    Figure PCTCN2021097139-appb-100006
    表示对x的L2正则化,r表示当前第一历史奖励数据,R(E(s),a)表示在当前第一历史无偏状态数据下采取当前第一历史动作数据对应的第一历史奖励数据。
    Among them, s represents the current first historical state data, E(s) represents the first historical unbiased state data output by s after passing through the encoder, a represents the current first historical action data, and A represents the set of all the first historical action data , p(a) represents the probability of selecting the current historical state data among all the first historical action data, p(a|E(s)) represents the probability of taking the current first historical action data under the current first historical unbiased state data probability, D(E(s)) represents the first historical analytical state data output after the first historical unbiased state data passes through the decoder,
    Figure PCTCN2021097139-appb-100006
    Represents the L2 regularization of x, r represents the current first historical reward data, and R(E(s), a) represents the first historical reward corresponding to the current first historical action data under the current first historical unbiased state data data.
  12. 根据权利要求8所述的患者用药方案的确定设备,其中,所述深度强化学习模型的训练过程,包括以下步骤:The device for determining a patient's medication regimen according to claim 8, wherein the training process of the deep reinforcement learning model comprises the following steps:
    获取多个患者的第二历史样本数据,所述第二历史样本数据包括第二历史状态数据、第二历史动作数据和第二历史奖励数据;其中,所述第二历史状态数据包括所述患者的人口统计学信息、检验检查指标和用药史;所述第二历史动作数据包括医生针对所述患者开具的用药方案,所述第二历史奖励数据包括所述患者采取所述用药方案后的健康反馈信息;Acquiring second historical sample data of a plurality of patients, the second historical sample data includes second historical state data, second historical action data and second historical reward data; wherein the second historical state data includes the patient Demographic information, inspection indicators and medication history; the second historical action data includes the medication plan prescribed by the doctor for the patient, and the second historical reward data includes the patient's health after taking the medication plan Feedback;
    将所述第二历史状态数据作为输入,将所述第二历史奖励数据作为输出以训练所述深度强化学习模型中的策略函数,以使所述深度强化学习模型基于所述第二历史状态数据通过所述策略函数选择对应的第二历史动作数据时输出的所述第二历史奖励数据最大;Using the second historical state data as input and the second historical reward data as output to train a policy function in the deep reinforcement learning model so that the deep reinforcement learning model is based on the second historical state data The second historical reward data outputted when the corresponding second historical action data is selected by the strategy function is the largest;
    当所述深度强化学习模型的损失函数收敛于预设阈值时,所述训练过程结束。When the loss function of the deep reinforcement learning model converges to a preset threshold, the training process ends.
  13. 根据权利要求12所述的患者用药方案的确定设备,其中,所述深度强化学习模型的损失函数Loss2由以下算式确定:The device for determining a patient's medication regimen according to claim 12, wherein the loss function Loss2 of the deep reinforcement learning model is determined by the following formula:
    Loss2=(Q(s t,a t)-(r t+max(γ×Q(s t+1,a))) 2Loss2=(Q(s t ,at )-(r t +max(γ×Q(s t +1 ,a))) 2 ;
    上式中,s t表示在t时刻的第二历史状态数据,a t表示在t时刻的第二历史动作数据,r t表示对第二历史状态数据s t采取第二历史动作数据a t对应的第二历史奖励数据;Q(s t+1,a)表示对t+1时刻的第二历史状态数据采取第二历史动作数据时得到的第二历史奖励数据,γ为常数。 In the above formula, s t represents the second historical state data at time t , at represents the second historical action data at time t , and rt represents the second historical state data s t corresponding to the second historical action data at t. The second historical reward data of ; Q(s t+1 , a) represents the second historical reward data obtained when the second historical action data is taken for the second historical state data at time t+1, and γ is a constant.
  14. 根据权利要求12所述的患者用药方案的确定设备,其中,所述第二历史奖励数据包括短期奖励数据和长期奖励数据,所述长期奖励数据的权重高于所述短期奖励数据的权重。The device for determining a patient's medication regimen according to claim 12, wherein the second historical reward data includes short-term reward data and long-term reward data, and the weight of the long-term reward data is higher than the weight of the short-term reward data.
  15. 一种计算机可读存储介质,其中,所述存储介质上存储有患者用药方案的确定程序,所述患者用药方案的确定程序被处理器执行时实现以下步骤:A computer-readable storage medium, wherein a program for determining a patient's medication regimen is stored on the storage medium, and when the program for determining a patient's medication regimen is executed by a processor, the following steps are implemented:
    获取目标患者的原始状态数据,所述原始状态数据用于表征所述患者的病情特征;obtaining raw state data of the target patient, the raw state data being used to characterize the disease characteristics of the patient;
    将所述原始状态数据输入无偏模型,以得到消除了状态分布偏差的无偏状态数据;Inputting the original state data into an unbiased model to obtain unbiased state data with the state distribution deviation removed;
    将所述无偏状态数据输入深度强化学习模型,获取对所述目标患者采取不同的用药方案时对应的奖励值;其中所述奖励值是基于所述无偏状态数据采取所述用药方案后的预期反馈效果;Input the unbiased state data into a deep reinforcement learning model to obtain the corresponding reward value when different medication regimens are taken for the target patient; wherein the reward value is based on the unbiased state data after the medication regimen is taken. Expected feedback effect;
    基于最大的奖励值确定所述目标患者的用药方案。A medication regimen for the target patient is determined based on the maximum reward value.
  16. 根据权利要求15所述的存储介质,其中,所述无偏模型包括编码器、解码器和预测器,所述编码器用于对所述原始状态数据进行编码以输出无偏状态数据,所述解码器用于对所述无偏状态数据进行解码以得到与所述原始状态数据对应的解析状态数据,所述预测器基于所述解析状态数据,预测采取不同的用药方案时对应的奖励值;其中,所述编码器、所述解码器和所述预测器均为单层神经网络。16. The storage medium of claim 15, wherein the unbiased model includes an encoder, a decoder, and a predictor, the encoder for encoding the raw state data to output unbiased state data, the decoding The predictor is used to decode the unbiased state data to obtain analytical state data corresponding to the original state data, and the predictor predicts, based on the analytical state data, corresponding reward values when different medication regimens are taken; wherein, The encoder, the decoder and the predictor are all single-layer neural networks.
  17. 根据权利要求16所述的存储介质,其中,所述无偏模型的训练过程包括以下步骤:The storage medium of claim 16, wherein the training process of the unbiased model comprises the steps of:
    获取多个患者的第一历史样本数据,所述第一历史样本数据包括第一历史状态数据、第一历史动作数据和第一历史奖励数据;其中,所述第一历史状态数据包括所述患者的人口统计学信息、检验检查指标和用药史;所述第一历史动作数据包括医生针对所述患者开具的用药方案,所述第一历史奖励数据包括所述患者采取所述用药方案后的健康反馈信息;Obtaining first historical sample data of a plurality of patients, the first historical sample data includes first historical state data, first historical action data and first historical reward data; wherein, the first historical state data includes the patient Demographic information, inspection indicators and medication history; the first historical action data includes the medication plan prescribed by the doctor for the patient, and the first historical reward data includes the patient's health after taking the medication plan Feedback;
    将所述第一历史状态数据作为所述编码器的输入,将所述第一历史奖励数据作为所述预测器的输出训练所述无偏模型,以确定所述编码器、所述解码器和所述预测器中的权重参数;training the unbiased model using the first historical state data as the input of the encoder and the first historical reward data as the output of the predictor to determine the encoder, the decoder and the weight parameters in the predictor;
    当所述无偏模型的损失函数收敛于预设阈值时,所述无偏模型的训练过程结束。When the loss function of the unbiased model converges to a preset threshold, the training process of the unbiased model ends.
  18. 根据权利要求17所述的存储介质,其中,所述无偏模型的损失函数Loss1由以下算式确定:The storage medium according to claim 17, wherein the loss function Loss1 of the unbiased model is determined by the following formula:
    Loss1=Lce+Linf+Lr;Loss1=Lce+Linf+Lr;
    Lce=∑ a∈Ap(a)*log[p(a)/p(a|E(s))]; Lce=∑ a∈A p(a)*log[p(a)/p(a|E(s))];
    Figure PCTCN2021097139-appb-100007
    Figure PCTCN2021097139-appb-100007
    Figure PCTCN2021097139-appb-100008
    Figure PCTCN2021097139-appb-100008
    其中,s代表当前第一历史状态数据,E(s)代表s经过编码器后输出的第一历史无偏状态数据,a表示当前第一历史动作数据,A表示所有第一历史动作数据的集合,p(a)表示在所有第一历史动作数据中选择当前历史状态数据的概率,p(a|E(s))表示在当前第一历史无偏状态数据下采取当前第一历史动作数据的概率,D(E(s))表示第一历史无偏状态数据经过解码器后输出的第一历史解析状态数据,
    Figure PCTCN2021097139-appb-100009
    表示对x的L2正则化,r表示当前第一历史奖励数据,R(E(s),a)表示在当前第一历史无偏状态数据下采取当前第一历史动作数据对应的第一历史奖励数据。
    Among them, s represents the current first historical state data, E(s) represents the first historical unbiased state data output by s after passing through the encoder, a represents the current first historical action data, and A represents the set of all the first historical action data , p(a) represents the probability of selecting the current historical state data among all the first historical action data, p(a|E(s)) represents the probability of taking the current first historical action data under the current first historical unbiased state data probability, D(E(s)) represents the first historical analytical state data output after the first historical unbiased state data passes through the decoder,
    Figure PCTCN2021097139-appb-100009
    Represents the L2 regularization of x, r represents the current first historical reward data, and R(E(s), a) represents the first historical reward corresponding to the current first historical action data under the current first historical unbiased state data data.
  19. 根据权利要求15所述的存储介质,其中,所述深度强化学习模型的训练过程,包括以下步骤:The storage medium according to claim 15, wherein the training process of the deep reinforcement learning model comprises the following steps:
    获取多个患者的第二历史样本数据,所述第二历史样本数据包括第二历史状态数据、第二历史动作数据和第二历史奖励数据;其中,所述第二历史状态数据包括所述患者的人口统计学信息、检验检查指标和用药史;所述第二历史动作数据包括医生针对所述患者开具的用药方案,所述第二历史奖励数据包括所述患者采取所述用药方案后的健康反馈信息;Acquiring second historical sample data of a plurality of patients, the second historical sample data includes second historical state data, second historical action data and second historical reward data; wherein the second historical state data includes the patient Demographic information, inspection indicators and medication history; the second historical action data includes the medication plan prescribed by the doctor for the patient, and the second historical reward data includes the patient's health after taking the medication plan Feedback;
    将所述第二历史状态数据作为输入,将所述第二历史奖励数据作为输出以训练所述深度强化学习模型中的策略函数,以使所述深度强化学习模型基于所述第二历史状态数据通过所述策略函数选择对应的第二历史动作数据时输出的所述第二历史奖励数据最大;Using the second historical state data as input and the second historical reward data as output to train a policy function in the deep reinforcement learning model so that the deep reinforcement learning model is based on the second historical state data The second historical reward data outputted when the corresponding second historical action data is selected by the strategy function is the largest;
    当所述深度强化学习模型的损失函数收敛于预设阈值时,所述训练过程结束。When the loss function of the deep reinforcement learning model converges to a preset threshold, the training process ends.
  20. 一种患者用药方案的确定装置,其中,包括:A device for determining a patient's medication regimen, comprising:
    原始状态获取模块,适用于获取目标患者的原始状态数据,所述原始状态数据用于表征所述患者的病情特征;an original state acquisition module, which is suitable for acquiring original state data of the target patient, where the original state data is used to characterize the disease characteristics of the patient;
    无偏处理模块,适用于将所述原始状态数据输入无偏模型,以得到消除了状态分布偏差的无偏状态数据;an unbiased processing module, suitable for inputting the original state data into an unbiased model, so as to obtain unbiased state data from which the state distribution deviation is eliminated;
    深度学习模块,适用于将所述无偏状态数据输入深度强化学习模型,获取对所述目标患者采取不同的用药方案时对应的奖励值;其中所述奖励值是基于所述无偏状态数据采取所述用药方案后的预期反馈效果;The deep learning module is suitable for inputting the unbiased state data into the deep reinforcement learning model, and obtains the corresponding reward value when different medication regimens are taken for the target patient; wherein the reward value is taken based on the unbiased state data. The expected feedback effect after the medication regimen;
    方案确定模块,适用于基于最大的奖励值确定所述目标患者的用药方案。The regimen determination module is adapted to determine the medication regimen of the target patient based on the maximum reward value.
PCT/CN2021/097139 2021-04-29 2021-05-31 Method and device for determining drug regimen of patient WO2022227198A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110474846.9A CN113255735B (en) 2021-04-29 2021-04-29 Method and device for determining medication scheme of patient
CN202110474846.9 2021-04-29

Publications (1)

Publication Number Publication Date
WO2022227198A1 true WO2022227198A1 (en) 2022-11-03

Family

ID=77223311

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/097139 WO2022227198A1 (en) 2021-04-29 2021-05-31 Method and device for determining drug regimen of patient

Country Status (2)

Country Link
CN (1) CN113255735B (en)
WO (1) WO2022227198A1 (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113782192A (en) * 2021-09-30 2021-12-10 平安科技(深圳)有限公司 Grouping model construction method based on causal inference and medical data processing method
CN115658877B (en) * 2022-12-27 2023-03-21 神州医疗科技股份有限公司 Medicine recommendation method and device based on reinforcement learning, electronic equipment and medium
CN116205232B (en) * 2023-02-28 2023-09-01 之江实验室 Method, device, storage medium and equipment for determining target model
CN116779096B (en) * 2023-06-28 2024-04-16 南栖仙策(南京)高新技术有限公司 Medication policy determination method, device, equipment and storage medium
CN117275661B (en) * 2023-11-23 2024-02-09 太原理工大学 Deep reinforcement learning-based lung cancer patient medication prediction method and device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109255443A (en) * 2018-08-07 2019-01-22 阿里巴巴集团控股有限公司 The method and device of training deeply learning model
CN111600851A (en) * 2020-04-27 2020-08-28 浙江工业大学 Feature filtering defense method for deep reinforcement learning model
CN111785366A (en) * 2020-06-29 2020-10-16 平安科技(深圳)有限公司 Method and device for determining patient treatment scheme and computer equipment
CN111816309A (en) * 2020-07-13 2020-10-23 国家康复辅具研究中心 Rehabilitation training prescription self-adaptive recommendation method and system based on deep reinforcement learning
CN112307726A (en) * 2020-11-09 2021-02-02 浙江大学 Automatic court opinion generation method guided by causal deviation removal model

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018211139A1 (en) * 2017-05-19 2018-11-22 Deepmind Technologies Limited Training action selection neural networks using a differentiable credit function
US20200272905A1 (en) * 2019-02-26 2020-08-27 GE Precision Healthcare LLC Artificial neural network compression via iterative hybrid reinforcement learning approach
US11651841B2 (en) * 2019-05-15 2023-05-16 International Business Machines Corporation Drug compound identification for target tissue cells
CN112580801B (en) * 2020-12-09 2021-10-15 广州优策科技有限公司 Reinforced learning training method and decision-making method based on reinforced learning

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109255443A (en) * 2018-08-07 2019-01-22 阿里巴巴集团控股有限公司 The method and device of training deeply learning model
CN111600851A (en) * 2020-04-27 2020-08-28 浙江工业大学 Feature filtering defense method for deep reinforcement learning model
CN111785366A (en) * 2020-06-29 2020-10-16 平安科技(深圳)有限公司 Method and device for determining patient treatment scheme and computer equipment
CN111816309A (en) * 2020-07-13 2020-10-23 国家康复辅具研究中心 Rehabilitation training prescription self-adaptive recommendation method and system based on deep reinforcement learning
CN112307726A (en) * 2020-11-09 2021-02-02 浙江大学 Automatic court opinion generation method guided by causal deviation removal model

Also Published As

Publication number Publication date
CN113255735B (en) 2024-04-09
CN113255735A (en) 2021-08-13

Similar Documents

Publication Publication Date Title
WO2022227198A1 (en) Method and device for determining drug regimen of patient
Wang et al. Methods for correcting inference based on outcomes predicted by machine learning
US11694109B2 (en) Data processing apparatus for accessing shared memory in processing structured data for modifying a parameter vector data structure
CN110135681B (en) Risk user identification method and device, readable storage medium and terminal equipment
US20190019582A1 (en) Systems and methods for predicting multiple health care outcomes
CN115082920B (en) Deep learning model training method, image processing method and device
CN111696661A (en) Patient clustering model construction method, patient clustering method and related equipment
WO2021174881A1 (en) Multi-dimensional information combination prediction method, apparatus, computer device, and medium
Seki et al. Machine learning-based prediction of in-hospital mortality using admission laboratory data: A retrospective, single-site study using electronic health record data
Rahaman Khan et al. Variable selection for accelerated lifetime models with synthesized estimation techniques
CN111967581B (en) Method, device, computer equipment and storage medium for interpreting grouping model
CN114298299A (en) Model training method, device, equipment and storage medium based on course learning
Paquette et al. Machine learning support for decision-making in kidney transplantation: step-by-step development of a technological solution
Bykova et al. Hidden Markov models for evolution and comparative genomics analysis
WO2023050668A1 (en) Clustering model construction method based on causal inference and medical data processing method
US20230161838A1 (en) Artificial intelligence model training that ensures equitable performance across sub-groups
CN114462522A (en) Lung cancer life cycle prediction model training and prediction method, system, device and medium
CN113627513A (en) Training data generation method and system, electronic device and storage medium
CN115516473A (en) Hybrid human-machine learning system
Wang et al. Adaptive treatment strategies for chronic conditions: shared-parameter G-estimation with an application to rheumatoid arthritis
Zhang et al. Doubly robust estimation of optimal dynamic treatment regimes with multicategory treatments and survival outcomes
Cheng et al. Extubation decision making with predictive information for mechanically ventilated patients in ICU
CN112509640B (en) Gene ontology item name generation method and device and storage medium
CN117436550B (en) Recommendation model training method and device
Wong et al. A new test for tail index with application to Danish fire loss data

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21938668

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21938668

Country of ref document: EP

Kind code of ref document: A1