US20230376774A1 - Adversarial Cooperative Imitation Learning for Dynamic Treatment - Google Patents

Adversarial Cooperative Imitation Learning for Dynamic Treatment Download PDF

Info

Publication number
US20230376774A1
US20230376774A1 US18/362,166 US202318362166A US2023376774A1 US 20230376774 A1 US20230376774 A1 US 20230376774A1 US 202318362166 A US202318362166 A US 202318362166A US 2023376774 A1 US2023376774 A1 US 2023376774A1
Authority
US
United States
Prior art keywords
trajectories
model
discriminator
resulted
negative
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
US18/362,166
Inventor
Wenchao Yu
Hiafeng Chen
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
NEC Laboratories America Inc
Original Assignee
NEC Laboratories America Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by NEC Laboratories America Inc filed Critical NEC Laboratories America Inc
Priority to US18/362,166 priority Critical patent/US20230376774A1/en
Assigned to NEC LABORATORIES AMERICA, INC. reassignment NEC LABORATORIES AMERICA, INC. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: CHEN, HAIFENG, YU, Wenchao
Publication of US20230376774A1 publication Critical patent/US20230376774A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/20ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/004Artificial life, i.e. computing arrangements simulating life
    • G06N3/006Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models
    • G06N5/046Forward inferencing; Production systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N7/00Computing arrangements based on specific mathematical models
    • G06N7/01Probabilistic graphical models, e.g. probabilistic networks
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H20/00ICT specially adapted for therapies or health-improving plans, e.g. for handling prescriptions, for steering therapy or for monitoring patient compliance
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/30ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for calculating health indices; for individual health risk assessment

Definitions

  • the present invention relates to providing medical treatments to patients, and, more particularly, to determining tailored treatments that are adjusted over time according to the changing state of the patients.
  • Determining treatments for individual patients has historically been performed by highly skilled doctors, who apply their experience and training to assess the patient's needs and provide a course of treatment.
  • the fallibility of human judgment leads to errors.
  • a method for responding to changing conditions includes training a model, using a processor, using trajectories that resulted in a positive outcome and trajectories that resulted in a negative outcome. Training is performed using an adversarial discriminator to train the model to generate trajectories that are similar to historical trajectories that resulted in a positive outcome, and using a cooperative discriminator to train the model to generate trajectories that are dissimilar to historical trajectories that resulted in a negative outcome.
  • a dynamic response regime is generated using the trained model and environment information. A response to changing environment conditions is performed in accordance with the dynamic response regime.
  • a method for treating a patient includes training a model on historical treatment trajectories, including trajectories that resulted in a positive health outcome and trajectories that resulted in a negative health outcome.
  • a dynamic treatment regime is generated for a patient using the trained model and patient information. The patient is treated in accordance with the dynamic treatment regime, in a manner that is responsive to changing patient conditions, by triggering one or more medical devices to administer a treatment to the patient.
  • a system for treating a patient includes a machine learning model, configured to generate a dynamic response regime for using environment information.
  • a model trainer is configured to train the machine learning model, including trajectories that resulted in a positive outcome and trajectories that resulted in a negative outcome, by using an adversarial discriminator to train the machine learning model to generate trajectories that are similar to historical trajectories that resulted in a positive outcome, and by using a cooperative discriminator to train the model to generate trajectories that are dissimilar to historical trajectories that resulted in a negative outcome.
  • a response interface is configured to trigger a response to changing environment conditions in accordance with the dynamic response regime.
  • FIG. 1 is a block diagram showing a patient being monitored and treated by a system that uses a dynamic treatment regime to react to changing patient conditions, in accordance with an embodiment of the present invention
  • FIG. 2 is a block/flow diagram of a method for generating and implementing a dynamic treatment regime for a patient, in accordance with an embodiment of the present invention
  • FIG. 3 is a block/flow diagram of a method for training a machine learning model to generate dynamic treatment regimes, in accordance with an embodiment of the present invention
  • FIG. 4 is pseudo-code for a learning process for a machine learning model to generate dynamic treatment regimes, in accordance with an embodiment of the present invention
  • FIG. 5 is a block diagram of a dynamic treatment regime system that generates and implements a dynamic treatment regime, in accordance with an embodiment of the present invention
  • FIG. 6 is a diagram of an exemplary neural network structure, in accordance with an embodiment of the present invention.
  • FIG. 7 is a diagram of an exemplary neural network structure with weights, in accordance with an embodiment of the present invention.
  • Embodiments of the present invention provide a dynamic treatment regime (DTR), a sequence of tailored treatment decisions that specify how treatments should be adjusted through time, in accordance with the dynamic states of patients.
  • Rules in the DTR can take input information, such as a patient's medical history, laboratory results, and demographic information, and output recommended treatments to improve the effectiveness of the treatment program.
  • the present embodiments can make use of deep reinforcement techniques for machine learning, for example to learn treatment policies from doctors' previous treatment plans.
  • the present embodiments do so in such a way as to avoid the compounding errors that can result from supervised methods that are based on behavior cloning and the sparsity of self-defined reward signals in reinforcement learning models.
  • Treatment paths are considered that include both positive trajectories, where a positive health outcome was achieved for a patient, and negative trajectories, where a negative health outcome resulted. By using both positive and negative trajectories, productive strategies are learned, and unproductive strategies are avoided.
  • the present embodiments use an adversarial cooperative imitation learning (ACIL) model to determine the dynamic treatment regimes that produce positive outcomes, while staying away from negative trajectories.
  • AIL adversarial cooperative imitation learning
  • Two discriminators can be used, including an adversarial discriminator and a cooperative discriminator.
  • the adversarial discriminator minimizes the discrepancies between the output trajectories and the positive trajectories in a set of training data, while the cooperative discriminator distinguishes the negative trajectories from the positive trajectories and the output trajectories.
  • Reward signals from the discriminators are used to refine the policy that generates dynamic treatment regimes.
  • DTRs are generated in response to specific patient information. These DTRs are then implemented, by providing the specified care and treatment to the patients, responsive to the changing condition for each patient.
  • the present embodiments thereby reduce the likelihood of a negative health outcome and provide superior dynamic treatment regimens.
  • a patient 102 is shown.
  • the patient 102 may, for example, have a medical condition that is being treated.
  • One or more sensors 104 monitor information about the patient's condition, and provide the information to patient monitor 106 .
  • This information may include vital information, such as heart rate, blood oxygen saturation, blood pressure, body temperature, blood sugar levels.
  • the information may also include patient activity information, such as movements and location. In each case, the information may be collected by any appropriate sensing device or device(s) 104 .
  • the patient monitor 106 may also accept information about the patient that is not sensed directly, for example including the patient's demographic information (e.g., age, medical history, family medical history, etc.) and the patient's own statement of symptoms, for example input by the patient or collected by a medical professional.
  • the patient's demographic information e.g., age, medical history, family medical history, etc.
  • the patient's own statement of symptoms for example input by the patient or collected by a medical professional.
  • the patient monitor 106 renders the collected information in a format suitable for the DTR system 108 .
  • the DTR system 108 includes a set of rules for how treatment should progress, based on updates to the patient's monitored information. As just one example of such a rule, if a patient's blood pressure were to drop below a threshold, the DTR system 108 may indicate an appropriate medical response and adjustment to treatment.
  • the DTR system's policies are learned in advance, as described in greater detail below, to incorporate past instances of successful and unsuccessful treatments, thereby providing a set of rules that stay close to successful treatment trajectories, while staying away from unsuccessful treatment trajectories.
  • a treatment application system 110 accepts directives from the DTR system 108 and takes an appropriate action.
  • the treatment system 110 can output an alert or an instruction for the recommended treatment.
  • the treatment recommendation can include an automatic treatment intervention, by way of one or more medical treatment devices 112 .
  • the treatment system 110 may cause a treatment device to introduce an appropriate medication to the patient's bloodstream.
  • the present embodiments can make rapid adjustments to a patient's treatment, responsive to the patient's changing medical condition. This reduces the reliance on fallible human decision-making and can lead to superior outcomes, particularly in stressful situations, where a decision needs to be made quickly and correctly.
  • Block 202 builds a set of training data that includes, for example, records of historical treatment trajectories.
  • the historical treatment trajectories may include information about patient condition, information about the timing and type of treatment actions and changes, and information about the treatment's outcome. Treatment trajectories with both positive health outcomes and negative health outcomes are included in the training set.
  • the trajectories can be represented as sequences of states and actions (s 0 , a 0 , s 1 , a 1 , . . . ) drawn from a policy ⁇ .
  • each state s t ⁇ includes collected patient information at a time t
  • each action a t ⁇ includes a K-dimensional binary-valued vector, where the value on each dimension represents the application of a particular medication, dosage, or treatment action.
  • Some of the trajectories are associated with policies that result in positive outcomes ( ⁇ + ), while other trajectories are associated with policies that result in negative outcomes ( ⁇ ⁇ ).
  • Block 204 uses the training set to train the ACIL model.
  • This model may be implemented using machine learning techniques, described in greater detail below.
  • the model accepts patient information as an input, and outputs one or more DTR policies for the patient.
  • a DTR policy includes one or more rules that are used to adapt treatment to changing patient conditions.
  • Block 206 then collects information for a specific patient 102 , as described above.
  • the patient information is used as an input to the ACIL model to produce a DTR policy for the specific patient 102 , relating to that patient's treatment needs.
  • the output policy can be expressed as ⁇ ⁇ , with a parameter vector ⁇ that represents the particular policy rules.
  • Block 210 then applies a recommended treatment to the patient 102 , using the collected patient information, following a trajectory ⁇ ⁇ that is generated by the policy ⁇ ⁇ .
  • block 212 updates the patient information, for example with current measurements.
  • Block 210 uses this updated information to determine any updated treatments that may be needed, according to the DTR. This process can continue indefinitely, or can be interrupted by a positive or negative health outcome.
  • block 302 trains the patient model, which serves as an environment simulator.
  • the adversarial discriminator, cooperative discriminator, and policy network are then iteratively trained until they converge in blocks 304 , 306 , and 308 .
  • Convergence can be determined, for example, by determining that the improvement from one iteration to the next has fallen below a predetermined threshold. Alternatively, processing can stop when a predetermined number of iterations has been reached.
  • the environment can be simulated with generative models, such as variational auto-encoders, for model-based reinforcement learning and trajectory embedding.
  • a generative adversarial network can be used instead.
  • the variational auto-encoder architecture builds a patient model that transforms a state distribution into an underlying latent space.
  • the patient model includes an encoder, which maps the current state and action to a latent distribution z ⁇ ( ⁇ , ⁇ ), and a decoder, which maps latent z and the current state s t and action a t into a successor state ⁇ t+1 .
  • the patient model is trained to minimize a reconstruction error between the input state s t+1 and a reconstructed state ⁇ t+1 that is generated by the decoder, under the latent distribution z.
  • An objective function for this can be expressed as:
  • w is a reconstruction error
  • s t is a state at time t
  • a t is an action at time t
  • the variable ⁇ represents a balancing weight between two kinds of loss
  • D KL is the Kullback-Liebler divergence.
  • the auto-encoder seeks to “encode” the input information, in this case the “actions” and “states,” and translates them to the latent space.
  • this latent space may represent the actions and states as vectors, which can be readily compared to one another.
  • the decoder then translates those vectors back to “actions” and “states,” and an error w represents the difference between the output of the decoder and the input to the encoder.
  • the parameters of the auto-encoder are then modified to reduce the value of the error. Training continues, with the parameters being modified at each iteration, until the error value reaches a point where no further training is needed. This may be triggered, for example, when the error value falls below a threshold, or when the error value does not change significantly over a number of iterations.
  • training the adversarial discriminator includes a comparison between the trajectories of positive outcome scenarios and the trajectories generated by a policy network.
  • the differences between two policies e.g., the policy ⁇ ⁇ generated by the ACIL model, and a policy with a positive outcome ⁇ +
  • the occupancy measure can be interpreted as the distribution of state-action pairs that the policy interacts with in the environment.
  • a policy ⁇ ⁇ can be implemented as a multiple-layer perceptron network, where ⁇ ⁇ takes the state of the patient as an input and returns, for example, recommended medications.
  • the adversarial discriminator D a (s, a) can also be implemented as a multiple-layer perceptron network, having a number and dimension of layers that are fine-tuned parameters, which estimates the probability that a state-action pair (s, a) comes from a positive trajectory policy ⁇ + , rather than a generated policy ⁇ ⁇ .
  • the learning of the adversarial discriminator can be expressed as the following objective function:
  • This objective function is equivalent to minimizing the Jensen-Shannon divergence D JS between the distributions of state-action pairs ⁇ ⁇ ⁇ and ⁇ ⁇ + , which are generated by interacting with the environment using policy ⁇ ⁇ and policy ⁇ + . represents the expectation over all (s, a) pairs sampled from ⁇ ⁇ ⁇ .
  • D a is referred to as an adversarial discriminator, because the goals of optimizing D a and ⁇ ⁇ are opposite—D a seeks to minimize the probability of the state-action pair generated by ⁇ ⁇ , while ⁇ ⁇ is selected to maximize the probability of D a making a mistake.
  • training the cooperative discriminator includes training a model to differentiate the generated trajectories and the positive trajectory policies from the negative trajectory policies.
  • the occupancy measure ⁇ ⁇ can be used again to compare the different policies.
  • the objective function for learning the cooperative discriminator D c can be expressed as:
  • This objective function characterizes the optimal negative log loss of classifying the positive trajectories generated from ⁇ ⁇ and ⁇ + and the negative trajectories generated from ⁇ ⁇ .
  • This is referred to as a cooperative discriminator because the goals of D c and ⁇ ⁇ are both to maximize the probability of the data that is generated by ⁇ ⁇ is positive.
  • the losses from D a and D c can be considered as reward functions that help refine ⁇ ⁇ .
  • the distribution ⁇ ⁇ ⁇ is different from ⁇ ⁇ ⁇ , it receives a large reward from D c .
  • the loss of ⁇ ⁇ is D JS ( ⁇ ⁇ + + ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ .
  • training the policy network seeks to update the policy network ⁇ ⁇ to mimic positive trajectories, while staying away from negative trajectories.
  • the network incorporates the reward signals from both D a and D c .
  • the signal from D a is used to push ⁇ ⁇ closer to ⁇ + , while the signal D c separates ⁇ ⁇ and ⁇ ⁇ .
  • the loss function can be defined as:
  • H( ⁇ ) is the casual entropy of the policy, which encourages diversity in the learned policy
  • ⁇ 0 is a parameter that is used to control H( ⁇ ⁇ ).
  • the parameters ⁇ ⁇ and ⁇ ⁇ are weights with values between 0 and 1, and balance the reward signals.
  • the adversarial discriminator D a , the cooperative discriminator D c , and the policy network ⁇ ⁇ are trained in a three-party min-max game, which can be defined as:
  • ⁇ a and ⁇ b are weight parameters that weight the contribution of the adversarial discriminator and the cooperative discriminator.
  • the entropy of the policy ⁇ ⁇ encourages policy diversity, and is defined as:
  • the present embodiments generated policies that substantially outperformed baseline processes for generating treatment trajectories.
  • ACIL considers discovering DTRs as a sequential decision-making problem and focuses on the long-term influence of the current action. Additionally, with the use of both positive and negative trajectory examples as training data, ACIL is able to mimic policies that have positive health outcomes, while avoiding mistakes. The result is a superior treatment policy, that responds to changing patient conditions in a manner that maximizes the likelihood of a positive health outcome.
  • Embodiments described herein may be entirely hardware, entirely software or including both hardware and software elements.
  • the present invention is implemented in software, which includes but is not limited to firmware, resident software, microcode, etc.
  • Embodiments may include a computer program product accessible from a computer-usable or computer-readable medium providing program code for use by or in connection with a computer or any instruction execution system.
  • a computer-usable or computer readable medium may include any apparatus that stores, communicates, propagates, or transports the program for use by or in connection with the instruction execution system, apparatus, or device.
  • the medium can be magnetic, optical, electronic, electromagnetic, infrared, or semiconductor system (or apparatus or device) or a propagation medium.
  • the medium may include a computer-readable storage medium such as a semiconductor or solid state memory, magnetic tape, a removable computer diskette, a random access memory (RAM), a read-only memory (ROM), a rigid magnetic disk and an optical disk, etc.
  • Each computer program may be tangibly stored in a machine-readable storage media or device (e.g., program memory or magnetic disk) readable by a general or special purpose programmable computer, for configuring and controlling operation of a computer when the storage media or device is read by the computer to perform the procedures described herein.
  • the inventive system may also be considered to be embodied in a computer-readable storage medium, configured with a computer program, where the storage medium so configured causes a computer to operate in a specific and predefined manner to perform the functions described herein.
  • a data processing system suitable for storing and/or executing program code may include at least one processor coupled directly or indirectly to memory elements through a system bus.
  • the memory elements can include local memory employed during actual execution of the program code, bulk storage, and cache memories which provide temporary storage of at least some program code to reduce the number of times code is retrieved from bulk storage during execution.
  • I/O devices including but not limited to keyboards, displays, pointing devices, etc. may be coupled to the system either directly or through intervening I/O controllers.
  • Network adapters may also be coupled to the system to enable the data processing system to become coupled to other data processing systems or remote printers or storage devices through intervening private or public networks.
  • Modems, cable modem and Ethernet cards are just a few of the currently available types of network adapters.
  • the term “hardware processor subsystem” or “hardware processor” can refer to a processor, memory, software or combinations thereof that cooperate to perform one or more specific tasks.
  • the hardware processor subsystem can include one or more data processing elements (e.g., logic circuits, processing circuits, instruction execution devices, etc.).
  • the one or more data processing elements can be included in a central processing unit, a graphics processing unit, and/or a separate processor- or computing element-based controller (e.g., logic gates, etc.).
  • the hardware processor subsystem can include one or more on-board memories (e.g., caches, dedicated memory arrays, read only memory, etc.).
  • the hardware processor subsystem can include one or more memories that can be on or off board or that can be dedicated for use by the hardware processor subsystem (e.g., ROM, RAM, basic input/output system (BIOS), etc.).
  • the hardware processor subsystem can include and execute one or more software elements.
  • the one or more software elements can include an operating system and/or one or more applications and/or specific code to achieve a specified result.
  • the hardware processor subsystem can include dedicated, specialized circuitry that performs one or more electronic processing functions to achieve a specified result.
  • Such circuitry can include one or more application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), and/or programmable logic arrays (PLAs).
  • ASICs application-specific integrated circuits
  • FPGAs field-programmable gate arrays
  • PDAs programmable logic arrays
  • the system 108 can include a hardware processor 502 , and memory 504 that is coupled to the hardware processor 502 .
  • a monitor interface 506 provides communications between the DTR system 108 and the patient monitor 106
  • a treatment interface provides communications between the DTR system 108 and the treatment application system 110 .
  • the interfaces 106 and 110 can each include any appropriate wired or wireless communications protocol and medium.
  • the DTR system 108 may be integrated with one or both of the patient monitor 106 and the treatment application system 110 , such that the interfaces 106 and 110 represent internal communications, such as buses.
  • one or both of the patient monitor 106 and the treatment application system 110 can be implemented as separate, discrete pieces of hardware, that communicate with the DTR system 108 .
  • the DTR system 108 may include one or more functional modules.
  • such modules can be implemented as software that is stored in memory 504 and that is executed by hardware processor 502 .
  • such modules can be implemented as one or more discrete hardware components, for example implemented as application-specific integrated chips or field programmable gate arrays.
  • patient information is received through the monitor interface 506 .
  • this information may be received as discrete sensor readings from a variety of sensors 104 .
  • this information may be received from the patient monitor 106 as a consolidated vector that represents multiple measurements.
  • Some patient information may also be stored in the memory 504 , for example in the form of patient demographic information and medical history.
  • the ACIL model 510 uses the collected patient information to generate a treatment trajectory. This trajectory is updated as new patient information is received.
  • the treatment interface 508 sends information about the treatment trajectory to the treatment application system 110 , for use with the patient.
  • the ACIL model 510 may be implemented with one or more artificial neural networks. These networks are trained, for example in the manner described above, using model trainer 512 .
  • Model trainer uses a set of training data, which may be stored in memory 504 , and which may include treatment trajectories that resulted in positive health outcomes, as well as treatment trajectories that resulted in negative health outcomes.
  • An artificial neural network is an information processing system that is inspired by biological nervous systems, such as the brain.
  • the key element of ANNs is the structure of the information processing system, which includes a large number of highly interconnected processing elements (called “neurons”) working in parallel to solve specific problems.
  • ANNs are furthermore trained in-use, with learning that involves adjustments to weights that exist between the neurons.
  • An ANN is configured for a specific application, such as pattern recognition or data classification, through such a learning process.
  • ANNs demonstrate an ability to derive meaning from complicated or imprecise data and can be used to extract patterns and detect trends that are too complex to be detected by humans or other computer-based systems.
  • the structure of a neural network is known generally to have input neurons 602 that provide information to one or more “hidden” neurons 604 . Connections 608 between the input neurons 602 and hidden neurons 604 are weighted and these weighted inputs are then processed by the hidden neurons 604 according to some function in the hidden neurons 604 , with weighted connections 608 between the layers. There may be any number of layers of hidden neurons 604 , and as well as neurons that perform different functions. There exist different neural network structures as well, such as convolutional neural network, maxout network, etc. Finally, a set of output neurons 606 accepts and processes weighted input from the last set of hidden neurons 604 .
  • the output is compared to a desired output available from training data.
  • the error relative to the training data is then processed in “feed-back” computation, where the hidden neurons 604 and input neurons 602 receive information regarding the error propagating backward from the output neurons 606 .
  • weight updates are performed, with the weighted connections 608 being updated to account for the received error.
  • an ANN architecture 700 is shown. It should be understood that the present architecture is purely exemplary, and that other architectures or types of neural network may be used instead.
  • the ANN embodiment described herein is included with the intent of illustrating general principles of neural network computation at a high level of generality and should not be construed as limiting in any way.
  • layers of neurons described below and the weights connecting them are described in a general manner and can be replaced by any type of neural network layers with any appropriate degree or type of interconnectivity.
  • layers can include convolutional layers, pooling layers, fully connected layers, softmax layers, or any other appropriate type of neural network layer.
  • layers can be added or removed as needed and the weights can be omitted for more complicated forms of interconnection.
  • a set of input neurons 702 each provide an input signal in parallel to a respective row of weights 704 .
  • the weights 704 each have a respective settable value, such that a weight output passes from the weight 704 to a respective hidden neuron 706 to represent the weighted input to the hidden neuron 706 .
  • the weights 704 may simply be represented as coefficient values that are multiplied against the relevant signals. The signals from each weight adds column-wise and flows to a hidden neuron 706 .
  • the hidden neurons 706 use the signals from the array of weights 704 to perform some calculation.
  • the hidden neurons 706 then output a signal of their own to another array of weights 704 .
  • This array performs in the same way, with a column of weights 704 receiving a signal from their respective hidden neuron 706 to produce a weighted signal output that adds row-wise and is provided to the output neuron 708 .
  • any number of these stages may be implemented, by interposing additional layers of arrays and hidden neurons 706 . It should also be noted that some neurons may be constant neurons 709 , which provide a constant output to the array. The constant neurons 709 can be present among the input neurons 702 and/or hidden neurons 706 and are only used during feed-forward operation.
  • the output neurons 708 provide a signal back across the array of weights 704 .
  • the output layer compares the generated network response to training data and computes an error.
  • the error signal can be made proportional to the error value.
  • a row of weights 704 receives a signal from a respective output neuron 708 in parallel and produces an output which adds column-wise to provide an input to hidden neurons 706 .
  • the hidden neurons 706 combine the weighted feedback signal with a derivative of its feed-forward calculation and stores an error value before outputting a feedback signal to its respective column of weights 704 . This back propagation travels through the entire network 700 until all hidden neurons 706 and the input neurons 702 have stored an error value.
  • the stored error values are used to update the settable values of the weights 704 .
  • the weights 704 can be trained to adapt the neural network 700 to errors in its processing. It should be noted that the three modes of operation, feed forward, back propagation, and weight update, do not overlap with one another.
  • any of the following “/”, “and/or”, and “at least one of”, for example, in the cases of “A/B”, “A and/or B” and “at least one of A and B”, is intended to encompass the selection of the first listed option (A) only, or the selection of the second listed option (B) only, or the selection of both options (A and B).
  • such phrasing is intended to encompass the selection of the first listed option (A) only, or the selection of the second listed option (B) only, or the selection of the third listed option (C) only, or the selection of the first and the second listed options (A and B) only, or the selection of the first and third listed options (A and C) only, or the selection of the second and third listed options (B and C) only, or the selection of all three options (A and B and C).
  • This may be extended for as many items listed.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Medical Informatics (AREA)
  • Computational Linguistics (AREA)
  • Public Health (AREA)
  • Molecular Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Primary Health Care (AREA)
  • Epidemiology (AREA)
  • Pathology (AREA)
  • Databases & Information Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Algebra (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Medical Treatment And Welfare Office Work (AREA)
  • Measuring And Recording Apparatus For Diagnosis (AREA)

Abstract

Methods and systems for responding to changing conditions include training a model, using a processor, using trajectories that resulted in a positive outcome and trajectories that resulted in a negative outcome. Training is performed using an adversarial discriminator to train the model to generate trajectories that are similar to historical trajectories that resulted in a positive outcome, and using a cooperative discriminator to train the model to generate trajectories that are dissimilar to historical trajectories that resulted in a negative outcome. A dynamic response regime is generated using the trained model and environment information. A response to changing environment conditions is performed in accordance with the dynamic response regime.

Description

    RELATED APPLICATION INFORMATION
  • This application is a continuing application of U.S. patent application Ser. No. 16/998,228 filed 20 Aug. 2020, which claims the benefit of U.S. Provisional Patent Application Ser. No. 62/893,324, filed on 29 Aug. 2019, both of which are incorporated by reference in their entireties, incorporated herein by reference in its entirety.
  • BACKGROUND Technical Field
  • The present invention relates to providing medical treatments to patients, and, more particularly, to determining tailored treatments that are adjusted over time according to the changing state of the patients.
  • Description of the Related Art
  • Determining treatments for individual patients has historically been performed by highly skilled doctors, who apply their experience and training to assess the patient's needs and provide a course of treatment. However, the fallibility of human judgment leads to errors. As a result, there is a need to automate the process of medical decision-making, particularly as it applies to the modification of a treatment plan in response to changing patient conditions.
  • SUMMARY
  • A method for responding to changing conditions includes training a model, using a processor, using trajectories that resulted in a positive outcome and trajectories that resulted in a negative outcome. Training is performed using an adversarial discriminator to train the model to generate trajectories that are similar to historical trajectories that resulted in a positive outcome, and using a cooperative discriminator to train the model to generate trajectories that are dissimilar to historical trajectories that resulted in a negative outcome. A dynamic response regime is generated using the trained model and environment information. A response to changing environment conditions is performed in accordance with the dynamic response regime.
  • A method for treating a patient includes training a model on historical treatment trajectories, including trajectories that resulted in a positive health outcome and trajectories that resulted in a negative health outcome. A dynamic treatment regime is generated for a patient using the trained model and patient information. The patient is treated in accordance with the dynamic treatment regime, in a manner that is responsive to changing patient conditions, by triggering one or more medical devices to administer a treatment to the patient.
  • A system for treating a patient includes a machine learning model, configured to generate a dynamic response regime for using environment information. A model trainer is configured to train the machine learning model, including trajectories that resulted in a positive outcome and trajectories that resulted in a negative outcome, by using an adversarial discriminator to train the machine learning model to generate trajectories that are similar to historical trajectories that resulted in a positive outcome, and by using a cooperative discriminator to train the model to generate trajectories that are dissimilar to historical trajectories that resulted in a negative outcome. A response interface is configured to trigger a response to changing environment conditions in accordance with the dynamic response regime.
  • These and other features and advantages will become apparent from the following detailed description of illustrative embodiments thereof, which is to be read in connection with the accompanying drawings.
  • BRIEF DESCRIPTION OF DRAWINGS
  • The disclosure will provide details in the following description of preferred embodiments with reference to the following figures wherein:
  • FIG. 1 is a block diagram showing a patient being monitored and treated by a system that uses a dynamic treatment regime to react to changing patient conditions, in accordance with an embodiment of the present invention;
  • FIG. 2 is a block/flow diagram of a method for generating and implementing a dynamic treatment regime for a patient, in accordance with an embodiment of the present invention;
  • FIG. 3 is a block/flow diagram of a method for training a machine learning model to generate dynamic treatment regimes, in accordance with an embodiment of the present invention;
  • FIG. 4 is pseudo-code for a learning process for a machine learning model to generate dynamic treatment regimes, in accordance with an embodiment of the present invention;
  • FIG. 5 is a block diagram of a dynamic treatment regime system that generates and implements a dynamic treatment regime, in accordance with an embodiment of the present invention;
  • FIG. 6 is a diagram of an exemplary neural network structure, in accordance with an embodiment of the present invention; and
  • FIG. 7 is a diagram of an exemplary neural network structure with weights, in accordance with an embodiment of the present invention.
  • DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS
  • Embodiments of the present invention provide a dynamic treatment regime (DTR), a sequence of tailored treatment decisions that specify how treatments should be adjusted through time, in accordance with the dynamic states of patients. Rules in the DTR can take input information, such as a patient's medical history, laboratory results, and demographic information, and output recommended treatments to improve the effectiveness of the treatment program.
  • The present embodiments can make use of deep reinforcement techniques for machine learning, for example to learn treatment policies from doctors' previous treatment plans. The present embodiments do so in such a way as to avoid the compounding errors that can result from supervised methods that are based on behavior cloning and the sparsity of self-defined reward signals in reinforcement learning models. Treatment paths are considered that include both positive trajectories, where a positive health outcome was achieved for a patient, and negative trajectories, where a negative health outcome resulted. By using both positive and negative trajectories, productive strategies are learned, and unproductive strategies are avoided.
  • Toward that end, the present embodiments use an adversarial cooperative imitation learning (ACIL) model to determine the dynamic treatment regimes that produce positive outcomes, while staying away from negative trajectories. Two discriminators can be used, including an adversarial discriminator and a cooperative discriminator. The adversarial discriminator minimizes the discrepancies between the output trajectories and the positive trajectories in a set of training data, while the cooperative discriminator distinguishes the negative trajectories from the positive trajectories and the output trajectories. Reward signals from the discriminators are used to refine the policy that generates dynamic treatment regimes.
  • Based on the policies learned by the model, DTRs are generated in response to specific patient information. These DTRs are then implemented, by providing the specified care and treatment to the patients, responsive to the changing condition for each patient. The present embodiments thereby reduce the likelihood of a negative health outcome and provide superior dynamic treatment regimens.
  • Referring now to FIG. 1 , an embodiment of the present invention is shown. A patient 102 is shown. The patient 102 may, for example, have a medical condition that is being treated. One or more sensors 104 monitor information about the patient's condition, and provide the information to patient monitor 106. This information may include vital information, such as heart rate, blood oxygen saturation, blood pressure, body temperature, blood sugar levels. The information may also include patient activity information, such as movements and location. In each case, the information may be collected by any appropriate sensing device or device(s) 104. The patient monitor 106 may also accept information about the patient that is not sensed directly, for example including the patient's demographic information (e.g., age, medical history, family medical history, etc.) and the patient's own statement of symptoms, for example input by the patient or collected by a medical professional.
  • The patient monitor 106 renders the collected information in a format suitable for the DTR system 108. The DTR system 108 includes a set of rules for how treatment should progress, based on updates to the patient's monitored information. As just one example of such a rule, if a patient's blood pressure were to drop below a threshold, the DTR system 108 may indicate an appropriate medical response and adjustment to treatment. The DTR system's policies are learned in advance, as described in greater detail below, to incorporate past instances of successful and unsuccessful treatments, thereby providing a set of rules that stay close to successful treatment trajectories, while staying away from unsuccessful treatment trajectories.
  • A treatment application system 110 accepts directives from the DTR system 108 and takes an appropriate action. In some cases, when the treatment recommendation involves the intervention of a medical professional, the treatment system 110 can output an alert or an instruction for the recommended treatment. In other cases, the treatment recommendation can include an automatic treatment intervention, by way of one or more medical treatment devices 112. As just one example of such an automatic treatment, if the DTR system 108 indicates that a patient's dropping blood pressure necessitates a quick pharmaceutical intervention, the treatment system 110 may cause a treatment device to introduce an appropriate medication to the patient's bloodstream.
  • In this manner, the present embodiments can make rapid adjustments to a patient's treatment, responsive to the patient's changing medical condition. This reduces the reliance on fallible human decision-making and can lead to superior outcomes, particularly in stressful situations, where a decision needs to be made quickly and correctly.
  • Referring now to FIG. 2 , a method of treating a patient is shown. Block 202 builds a set of training data that includes, for example, records of historical treatment trajectories. The historical treatment trajectories may include information about patient condition, information about the timing and type of treatment actions and changes, and information about the treatment's outcome. Treatment trajectories with both positive health outcomes and negative health outcomes are included in the training set.
  • In some embodiments, the trajectories can be represented as sequences of states and actions (s0, a0, s1, a1, . . . ) drawn from a policy π. Thus, each state st
    Figure US20230376774A1-20231123-P00001
    includes collected patient information at a time t, and each action at
    Figure US20230376774A1-20231123-P00002
    includes a K-dimensional binary-valued vector, where the value on each dimension represents the application of a particular medication, dosage, or treatment action. Some of the trajectories are associated with policies that result in positive outcomes (π+), while other trajectories are associated with policies that result in negative outcomes (π). The positive trajectories can be expressed as τ+=(s1 +, a1 +, . . . ) and the negative trajectories can be expressed as τ=(s1 , a1 , . . . ).
  • Block 204 then uses the training set to train the ACIL model. This model may be implemented using machine learning techniques, described in greater detail below. The model accepts patient information as an input, and outputs one or more DTR policies for the patient. As noted above, a DTR policy includes one or more rules that are used to adapt treatment to changing patient conditions.
  • Block 206 then collects information for a specific patient 102, as described above. In block 208, the patient information is used as an input to the ACIL model to produce a DTR policy for the specific patient 102, relating to that patient's treatment needs. The output policy can be expressed as πθ, with a parameter vector θ that represents the particular policy rules. Block 210 then applies a recommended treatment to the patient 102, using the collected patient information, following a trajectory τθ that is generated by the policy πθ. As time goes on, block 212 updates the patient information, for example with current measurements. Block 210 then uses this updated information to determine any updated treatments that may be needed, according to the DTR. This process can continue indefinitely, or can be interrupted by a positive or negative health outcome.
  • Referring now to FIG. 3 , additional information on the training of the ACIL model in block 204 is shown. As an overview, block 302 trains the patient model, which serves as an environment simulator. The adversarial discriminator, cooperative discriminator, and policy network are then iteratively trained until they converge in blocks 304, 306, and 308. Convergence can be determined, for example, by determining that the improvement from one iteration to the next has fallen below a predetermined threshold. Alternatively, processing can stop when a predetermined number of iterations has been reached.
  • In block 302, the environment can be simulated with generative models, such as variational auto-encoders, for model-based reinforcement learning and trajectory embedding. As an alternative to using a variable auto-encoder, a generative adversarial network can be used instead. The variational auto-encoder architecture builds a patient model that transforms a state distribution into an underlying latent space. The patient model includes an encoder, which maps the current state and action to a latent distribution z˜
    Figure US20230376774A1-20231123-P00003
    (μ, σ), and a decoder, which maps latent z and the current state st and action at into a successor state ŝt+1. The patient model is trained to minimize a reconstruction error between the input state st+1 and a reconstructed state ŝt+1 that is generated by the decoder, under the latent distribution z. An objective function for this can be expressed as:
  • min w s t , a t , s ( t + 1 ) s t + 1 - s ˆ t + 1 2 + α D K L ( 𝒩 ( μ , σ ) "\[LeftBracketingBar]" "\[RightBracketingBar]" 𝒩 ( 0 , 1 ) )
  • where w is a reconstruction error, st is a state at time t, at is an action at time t, μ, σ=Ew 1 (st, at) is an encoder network that takes the current state st and action at as inputs, using a first parameter w1, and ŝt+1=Dw 2 (st, at, z) is the output a decoder network Dw 2 with a latent factor z and the current state and action as input, using a second parameter w2. The variable α represents a balancing weight between two kinds of loss, and the function DKL is the Kullback-Liebler divergence.
  • In general, the auto-encoder seeks to “encode” the input information, in this case the “actions” and “states,” and translates them to the latent space. In some embodiments, this latent space may represent the actions and states as vectors, which can be readily compared to one another. The decoder then translates those vectors back to “actions” and “states,” and an error w represents the difference between the output of the decoder and the input to the encoder. The parameters of the auto-encoder are then modified to reduce the value of the error. Training continues, with the parameters being modified at each iteration, until the error value reaches a point where no further training is needed. This may be triggered, for example, when the error value falls below a threshold, or when the error value does not change significantly over a number of iterations.
  • In block 304, training the adversarial discriminator includes a comparison between the trajectories of positive outcome scenarios and the trajectories generated by a policy network. In general, the differences between two policies (e.g., the policy πθ generated by the ACIL model, and a policy with a positive outcome π+) by comparing the trajectories they generate. For a policy π∈Π, the occupancy measure ρπ:
    Figure US20230376774A1-20231123-P00001
    ×
    Figure US20230376774A1-20231123-P00002
    Figure US20230376774A1-20231123-P00004
    can be defined as ρπ(s, a)=π(a|s) Σt=0 TγP(st=s|π), where γ is a discounting factor, T is a maximum time value, and where successor states are drawn from P(s|π). The occupancy measure can be interpreted as the distribution of state-action pairs that the policy interacts with in the environment. A policy πθ can be implemented as a multiple-layer perceptron network, where πθ takes the state of the patient as an input and returns, for example, recommended medications.
  • The adversarial discriminator Da(s, a) can also be implemented as a multiple-layer perceptron network, having a number and dimension of layers that are fine-tuned parameters, which estimates the probability that a state-action pair (s, a) comes from a positive trajectory policy π+, rather than a generated policy πθ. The learning of the adversarial discriminator can be expressed as the following objective function:
  • max D a 𝔼 ρ π θ [ log ( 1 - D a ( s , a ) ) ] + 𝔼 ρ π + [ log ( D a ( s , a ) ) ]
  • This objective function is equivalent to minimizing the Jensen-Shannon divergence DJS between the distributions of state-action pairs ρπ θ and ρπ + , which are generated by interacting with the environment using policy πθ and policy π+.
    Figure US20230376774A1-20231123-P00005
    represents the expectation over all (s, a) pairs sampled from ρπ θ . Da is referred to as an adversarial discriminator, because the goals of optimizing Da and πθ are opposite—Da seeks to minimize the probability of the state-action pair generated by πθ, while πθ is selected to maximize the probability of Da making a mistake.
  • In block 306, training the cooperative discriminator includes training a model to differentiate the generated trajectories and the positive trajectory policies from the negative trajectory policies. The occupancy measure ρπ can be used again to compare the different policies. The objective function for learning the cooperative discriminator Dc can be expressed as:
  • max D c 𝔼 ρ π θ , ρ π + [ log ( D c ( s , a ) ) ] + 𝔼 ρ π - [ log ( 1 - D c ( s , a ) ) ]
  • This objective function characterizes the optimal negative log loss of classifying the positive trajectories generated from πθ and π+ and the negative trajectories generated from π. This is referred to as a cooperative discriminator because the goals of Dc and πθ are both to maximize the probability of the data that is generated by πθ is positive. The losses from Da and Dc can be considered as reward functions that help refine πθ. When the distribution ρπ θ is different from ρπ , it receives a large reward from Dc. With an optimal Dc, the loss of πθ is DJSπ + π θ ∥ρπ .
  • In block 308, training the policy network seeks to update the policy network πθ to mimic positive trajectories, while staying away from negative trajectories. The network incorporates the reward signals from both Da and Dc. The signal from Da is used to push πθ closer to π+, while the signal Dc separates πθ and π. The loss function can be defined as:
  • min π θ ω α ( 𝔼 ρ π θ [ log ( 1 - D a ( s , a ) ) ] ) - ω β ( 𝔼 ρ θ [ log ( D c ( s , a ) ) ] ) - λ H ( π θ )
  • where H(π) is the casual entropy of the policy, which encourages diversity in the learned policy, and λ≥0 is a parameter that is used to control H(πθ). The parameters ωα and ωβ are weights with values between 0 and 1, and balance the reward signals.
  • The adversarial discriminator Da, the cooperative discriminator Dc, and the policy network πθ are trained in a three-party min-max game, which can be defined as:
  • min pi θ , D c max D a ω α ( 𝔼 ρ π θ [ log ( 1 - D a ( s , a ) ) ] + 𝔼 ρ π + [ log ( D a ( s , a ) ) ] ) - ω β ( 𝔼 ρ π θ , ρ π + [ log ( D c ( s , a ) ) ] + 𝔼 ρ π - [ log ( 1 - D c ( s , a ) ) ] ) - λ H ( π θ )
  • where ωa and ωb are weight parameters that weight the contribution of the adversarial discriminator and the cooperative discriminator. The entropy of the policy πθ encourages policy diversity, and is defined as:

  • H(πθ)
    Figure US20230376774A1-20231123-P00006
    Figure US20230376774A1-20231123-P00007
    π θ [−log πθ(a|s)]
  • When both Da and Dc are optimized, the outcome of the three-party min-max game is equivalent to the following optimization problem:
  • min π θ D J S ( ρ π + "\[LeftBracketingBar]" "\[RightBracketingBar]" ρ π θ ) - D J S ( ( ρ π + + ρ π θ ) "\[LeftBracketingBar]" "\[RightBracketingBar]" ρ π - ) - λ H ( π θ )
  • which finds a policy whose occupancy measure minimizes the JS divergence to π+ and maximizes the JS divergence to π.
  • Referring now to FIG. 4 , pseudo-code of the learning process for an ACIL model is shown. First the patient model Gw is trained, followed by iterative training of Da, Dc, and πθ.
  • In tests, the present embodiments generated policies that substantially outperformed baseline processes for generating treatment trajectories. ACIL considers discovering DTRs as a sequential decision-making problem and focuses on the long-term influence of the current action. Additionally, with the use of both positive and negative trajectory examples as training data, ACIL is able to mimic policies that have positive health outcomes, while avoiding mistakes. The result is a superior treatment policy, that responds to changing patient conditions in a manner that maximizes the likelihood of a positive health outcome.
  • Embodiments described herein may be entirely hardware, entirely software or including both hardware and software elements. In a preferred embodiment, the present invention is implemented in software, which includes but is not limited to firmware, resident software, microcode, etc.
  • Embodiments may include a computer program product accessible from a computer-usable or computer-readable medium providing program code for use by or in connection with a computer or any instruction execution system. A computer-usable or computer readable medium may include any apparatus that stores, communicates, propagates, or transports the program for use by or in connection with the instruction execution system, apparatus, or device. The medium can be magnetic, optical, electronic, electromagnetic, infrared, or semiconductor system (or apparatus or device) or a propagation medium. The medium may include a computer-readable storage medium such as a semiconductor or solid state memory, magnetic tape, a removable computer diskette, a random access memory (RAM), a read-only memory (ROM), a rigid magnetic disk and an optical disk, etc.
  • Each computer program may be tangibly stored in a machine-readable storage media or device (e.g., program memory or magnetic disk) readable by a general or special purpose programmable computer, for configuring and controlling operation of a computer when the storage media or device is read by the computer to perform the procedures described herein. The inventive system may also be considered to be embodied in a computer-readable storage medium, configured with a computer program, where the storage medium so configured causes a computer to operate in a specific and predefined manner to perform the functions described herein.
  • A data processing system suitable for storing and/or executing program code may include at least one processor coupled directly or indirectly to memory elements through a system bus. The memory elements can include local memory employed during actual execution of the program code, bulk storage, and cache memories which provide temporary storage of at least some program code to reduce the number of times code is retrieved from bulk storage during execution. Input/output or I/O devices (including but not limited to keyboards, displays, pointing devices, etc.) may be coupled to the system either directly or through intervening I/O controllers.
  • Network adapters may also be coupled to the system to enable the data processing system to become coupled to other data processing systems or remote printers or storage devices through intervening private or public networks. Modems, cable modem and Ethernet cards are just a few of the currently available types of network adapters.
  • As employed herein, the term “hardware processor subsystem” or “hardware processor” can refer to a processor, memory, software or combinations thereof that cooperate to perform one or more specific tasks. In useful embodiments, the hardware processor subsystem can include one or more data processing elements (e.g., logic circuits, processing circuits, instruction execution devices, etc.). The one or more data processing elements can be included in a central processing unit, a graphics processing unit, and/or a separate processor- or computing element-based controller (e.g., logic gates, etc.). The hardware processor subsystem can include one or more on-board memories (e.g., caches, dedicated memory arrays, read only memory, etc.). In some embodiments, the hardware processor subsystem can include one or more memories that can be on or off board or that can be dedicated for use by the hardware processor subsystem (e.g., ROM, RAM, basic input/output system (BIOS), etc.).
  • In some embodiments, the hardware processor subsystem can include and execute one or more software elements. The one or more software elements can include an operating system and/or one or more applications and/or specific code to achieve a specified result.
  • In other embodiments, the hardware processor subsystem can include dedicated, specialized circuitry that performs one or more electronic processing functions to achieve a specified result. Such circuitry can include one or more application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), and/or programmable logic arrays (PLAs).
  • These and other variations of a hardware processor subsystem are also contemplated in accordance with embodiments of the present invention.
  • Referring now to FIG. 5 , additional detail on the DTR system 108 is shown. The system 108 can include a hardware processor 502, and memory 504 that is coupled to the hardware processor 502. A monitor interface 506 provides communications between the DTR system 108 and the patient monitor 106, while a treatment interface provides communications between the DTR system 108 and the treatment application system 110.
  • It should be understood that the interfaces 106 and 110 can each include any appropriate wired or wireless communications protocol and medium. In some embodiments, the DTR system 108 may be integrated with one or both of the patient monitor 106 and the treatment application system 110, such that the interfaces 106 and 110 represent internal communications, such as buses. In some embodiments, one or both of the patient monitor 106 and the treatment application system 110 can be implemented as separate, discrete pieces of hardware, that communicate with the DTR system 108.
  • The DTR system 108 may include one or more functional modules. In some embodiments, such modules can be implemented as software that is stored in memory 504 and that is executed by hardware processor 502. In other embodiments, such modules can be implemented as one or more discrete hardware components, for example implemented as application-specific integrated chips or field programmable gate arrays.
  • During operation, patient information is received through the monitor interface 506. In some embodiments, this information may be received as discrete sensor readings from a variety of sensors 104. In other embodiments, this information may be received from the patient monitor 106 as a consolidated vector that represents multiple measurements. Some patient information may also be stored in the memory 504, for example in the form of patient demographic information and medical history.
  • The ACIL model 510 uses the collected patient information to generate a treatment trajectory. This trajectory is updated as new patient information is received. The treatment interface 508 sends information about the treatment trajectory to the treatment application system 110, for use with the patient.
  • In some embodiments, the ACIL model 510 may be implemented with one or more artificial neural networks. These networks are trained, for example in the manner described above, using model trainer 512. Model trainer uses a set of training data, which may be stored in memory 504, and which may include treatment trajectories that resulted in positive health outcomes, as well as treatment trajectories that resulted in negative health outcomes.
  • An artificial neural network (ANN) is an information processing system that is inspired by biological nervous systems, such as the brain. The key element of ANNs is the structure of the information processing system, which includes a large number of highly interconnected processing elements (called “neurons”) working in parallel to solve specific problems. ANNs are furthermore trained in-use, with learning that involves adjustments to weights that exist between the neurons. An ANN is configured for a specific application, such as pattern recognition or data classification, through such a learning process.
  • Referring now to FIG. 6 , a generalized diagram of a neural network is shown. ANNs demonstrate an ability to derive meaning from complicated or imprecise data and can be used to extract patterns and detect trends that are too complex to be detected by humans or other computer-based systems. The structure of a neural network is known generally to have input neurons 602 that provide information to one or more “hidden” neurons 604. Connections 608 between the input neurons 602 and hidden neurons 604 are weighted and these weighted inputs are then processed by the hidden neurons 604 according to some function in the hidden neurons 604, with weighted connections 608 between the layers. There may be any number of layers of hidden neurons 604, and as well as neurons that perform different functions. There exist different neural network structures as well, such as convolutional neural network, maxout network, etc. Finally, a set of output neurons 606 accepts and processes weighted input from the last set of hidden neurons 604.
  • This represents a “feed-forward” computation, where information propagates from input neurons 602 to the output neurons 606. Upon completion of a feed-forward computation, the output is compared to a desired output available from training data. The error relative to the training data is then processed in “feed-back” computation, where the hidden neurons 604 and input neurons 602 receive information regarding the error propagating backward from the output neurons 606. Once the backward error propagation has been completed, weight updates are performed, with the weighted connections 608 being updated to account for the received error. This represents just one variety of ANN.
  • Referring now to FIG. 7 , an ANN architecture 700 is shown. It should be understood that the present architecture is purely exemplary, and that other architectures or types of neural network may be used instead. The ANN embodiment described herein is included with the intent of illustrating general principles of neural network computation at a high level of generality and should not be construed as limiting in any way.
  • Furthermore, the layers of neurons described below and the weights connecting them are described in a general manner and can be replaced by any type of neural network layers with any appropriate degree or type of interconnectivity. For example, layers can include convolutional layers, pooling layers, fully connected layers, softmax layers, or any other appropriate type of neural network layer. Furthermore, layers can be added or removed as needed and the weights can be omitted for more complicated forms of interconnection.
  • During feed-forward operation, a set of input neurons 702 each provide an input signal in parallel to a respective row of weights 704. The weights 704 each have a respective settable value, such that a weight output passes from the weight 704 to a respective hidden neuron 706 to represent the weighted input to the hidden neuron 706. In software embodiments, the weights 704 may simply be represented as coefficient values that are multiplied against the relevant signals. The signals from each weight adds column-wise and flows to a hidden neuron 706.
  • The hidden neurons 706 use the signals from the array of weights 704 to perform some calculation. The hidden neurons 706 then output a signal of their own to another array of weights 704. This array performs in the same way, with a column of weights 704 receiving a signal from their respective hidden neuron 706 to produce a weighted signal output that adds row-wise and is provided to the output neuron 708.
  • It should be understood that any number of these stages may be implemented, by interposing additional layers of arrays and hidden neurons 706. It should also be noted that some neurons may be constant neurons 709, which provide a constant output to the array. The constant neurons 709 can be present among the input neurons 702 and/or hidden neurons 706 and are only used during feed-forward operation.
  • During back propagation, the output neurons 708 provide a signal back across the array of weights 704. The output layer compares the generated network response to training data and computes an error. The error signal can be made proportional to the error value. In this example, a row of weights 704 receives a signal from a respective output neuron 708 in parallel and produces an output which adds column-wise to provide an input to hidden neurons 706. The hidden neurons 706 combine the weighted feedback signal with a derivative of its feed-forward calculation and stores an error value before outputting a feedback signal to its respective column of weights 704. This back propagation travels through the entire network 700 until all hidden neurons 706 and the input neurons 702 have stored an error value.
  • During weight updates, the stored error values are used to update the settable values of the weights 704. In this manner the weights 704 can be trained to adapt the neural network 700 to errors in its processing. It should be noted that the three modes of operation, feed forward, back propagation, and weight update, do not overlap with one another.
  • Reference in the specification to “one embodiment” or “an embodiment” of the present invention, as well as other variations thereof, means that a particular feature, structure, characteristic, and so forth described in connection with the embodiment is included in at least one embodiment of the present invention. Thus, the appearances of the phrase “in one embodiment” or “in an embodiment”, as well any other variations, appearing in various places throughout the specification are not necessarily all referring to the same embodiment. However, it is to be appreciated that features of one or more embodiments can be combined given the teachings of the present invention provided herein.
  • It is to be appreciated that the use of any of the following “/”, “and/or”, and “at least one of”, for example, in the cases of “A/B”, “A and/or B” and “at least one of A and B”, is intended to encompass the selection of the first listed option (A) only, or the selection of the second listed option (B) only, or the selection of both options (A and B). As a further example, in the cases of “A, B, and/or C” and “at least one of A, B, and C”, such phrasing is intended to encompass the selection of the first listed option (A) only, or the selection of the second listed option (B) only, or the selection of the third listed option (C) only, or the selection of the first and the second listed options (A and B) only, or the selection of the first and third listed options (A and C) only, or the selection of the second and third listed options (B and C) only, or the selection of all three options (A and B and C). This may be extended for as many items listed.
  • The foregoing is to be understood as being in every respect illustrative and exemplary, but not restrictive, and the scope of the invention disclosed herein is not to be determined from the Detailed Description, but rather from the claims as interpreted according to the full breadth permitted by the patent laws. It is to be understood that the embodiments shown and described herein are only illustrative of the present invention and that those skilled in the art may implement various modifications without departing from the scope and spirit of the invention. Those skilled in the art could implement various other feature combinations without departing from the scope and spirit of the invention. Having thus described aspects of the invention, with the details and particularity required by the patent laws, what is claimed and desired protected by Letters Patent is set forth in the appended claims.

Claims (17)

What is claimed is:
1. A method for responding to changing conditions, comprising:
training a model, using a processor, including trajectories that resulted in a positive outcome and trajectories that resulted in a negative outcome, by using an adversarial discriminator to train the model to generate trajectories that are similar to historical trajectories that resulted in a positive outcome, and by using a cooperative discriminator to train the model to generate trajectories that are dissimilar to historical trajectories that resulted in a negative outcome, and including iteratively training the adversarial discriminator, the cooperative discriminator, and the dynamic response regime using a three-party optimization until improvement from one iteration to the next has fallen below a predetermined threshold;
generating a dynamic response regime using the trained model and environment information; and
responding to changing environment conditions in accordance with the dynamic response regime.
2. The method of claim 1, wherein the historical trajectories include patient treatment trajectories.
3. The method of claim 2, wherein the positive outcomes are positive patient health outcomes, and the negative outcomes are negative patient health outcomes.
4. The method of claim 2, wherein the environment information and the environment conditions reflect information about a patient being treated.
5. The method of claim 1, wherein the adversarial discriminator, the cooperative discriminator, and the dynamic response regime are implemented as multiple-layer perceptrons.
6. The method of claim 1, wherein training the model comprises training an environment model that encodes environment information as a vector in a latent space.
7. The method of claim 1, wherein the model is implemented as a variational auto-encoder network.
8. The method of claim 1, wherein responding to changing environment conditions comprises automatically performing a responsive action to correct a negative condition.
9. A system for responding to changing conditions, comprising:
a machine learning model, configured to generate a dynamic response regime for using environment information;
a model trainer, configured to train the machine learning model, including trajectories that resulted in a positive outcome and trajectories that resulted in a negative outcome, by using an adversarial discriminator to train the machine learning model to generate trajectories that are similar to historical trajectories that resulted in a positive outcome, and by using a cooperative discriminator to train the model to generate trajectories that are dissimilar to historical trajectories that resulted in a negative outcome, and to iteratively train the adversarial discriminator, the cooperative discriminator, and the dynamic response regime using a three-party optimization until improvement from one iteration to the next has fallen below a predetermined threshold; and
a response interface, configured to trigger a response to changing environment conditions in accordance with the dynamic response regime.
10. The system of claim 9, wherein the historical trajectories that resulted in a positive outcome and the historical trajectories that resulted in a negative outcome include patient treatment trajectories.
11. The system of claim 10, wherein the positive outcomes are positive patient health outcomes, and the negative outcomes are negative patient health outcomes.
12. The system of claim 9, wherein the environment information and the environment conditions reflect information about a patient being treated.
13. The system of claim 9, wherein the model trainer is further configured to iteratively train the adversarial discriminator, the cooperative discriminator, and the dynamic response regime using a three-party optimization.
14. The system of claim 9, wherein the adversarial discriminator, the cooperative discriminator, and the dynamic response regime are implemented as multiple-layer perceptrons in the machine learning model.
15. The system of claim 9, wherein the model trainer is further configured to train an environment model that encodes the environment information as a vector in a latent space.
16. The system of claim 15, wherein the environment model is implemented as a variational auto-encoder network in the machine learning model.
17. The system of claim 9, wherein the response interface is further configured to automatically perform a responsive action to correct a negative condition.
US18/362,166 2019-08-29 2023-07-31 Adversarial Cooperative Imitation Learning for Dynamic Treatment Pending US20230376774A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US18/362,166 US20230376774A1 (en) 2019-08-29 2023-07-31 Adversarial Cooperative Imitation Learning for Dynamic Treatment

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US201962893324P 2019-08-29 2019-08-29
US16/998,228 US11783189B2 (en) 2019-08-29 2020-08-20 Adversarial cooperative imitation learning for dynamic treatment
US18/362,166 US20230376774A1 (en) 2019-08-29 2023-07-31 Adversarial Cooperative Imitation Learning for Dynamic Treatment

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
US16/998,228 Continuation US11783189B2 (en) 2019-08-29 2020-08-20 Adversarial cooperative imitation learning for dynamic treatment

Publications (1)

Publication Number Publication Date
US20230376774A1 true US20230376774A1 (en) 2023-11-23

Family

ID=74679893

Family Applications (4)

Application Number Title Priority Date Filing Date
US16/998,228 Active 2042-04-03 US11783189B2 (en) 2019-08-29 2020-08-20 Adversarial cooperative imitation learning for dynamic treatment
US18/362,166 Pending US20230376774A1 (en) 2019-08-29 2023-07-31 Adversarial Cooperative Imitation Learning for Dynamic Treatment
US18/362,125 Pending US20230376773A1 (en) 2019-08-29 2023-07-31 Adversarial Cooperative Imitation Learning for Dynamic Treatment
US18/362,193 Pending US20240005163A1 (en) 2019-08-29 2023-07-31 Adversarial Cooperative Imitation Learning for Dynamic Treatment

Family Applications Before (1)

Application Number Title Priority Date Filing Date
US16/998,228 Active 2042-04-03 US11783189B2 (en) 2019-08-29 2020-08-20 Adversarial cooperative imitation learning for dynamic treatment

Family Applications After (2)

Application Number Title Priority Date Filing Date
US18/362,125 Pending US20230376773A1 (en) 2019-08-29 2023-07-31 Adversarial Cooperative Imitation Learning for Dynamic Treatment
US18/362,193 Pending US20240005163A1 (en) 2019-08-29 2023-07-31 Adversarial Cooperative Imitation Learning for Dynamic Treatment

Country Status (4)

Country Link
US (4) US11783189B2 (en)
JP (1) JP7305028B2 (en)
DE (1) DE112020004025T5 (en)
WO (1) WO2021041185A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024064953A1 (en) * 2022-09-23 2024-03-28 H. Lee Moffitt Cancer Center And Research Institute, Inc. Adaptive radiotherapy clinical decision support tool and related methods

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP6410289B2 (en) 2014-03-20 2018-10-24 日本電気株式会社 Pharmaceutical adverse event extraction method and apparatus
CN113421652A (en) * 2015-06-02 2021-09-21 推想医疗科技股份有限公司 Method for analyzing medical data, method for training model and analyzer
EP3613060A1 (en) * 2017-04-20 2020-02-26 Koninklijke Philips N.V. Learning and applying contextual similarities between entities
US11266355B2 (en) * 2017-05-19 2022-03-08 Cerner Innovation, Inc. Early warning system and method for predicting patient deterioration
WO2019049819A1 (en) * 2017-09-08 2019-03-14 日本電気株式会社 Medical information processing system
KR101946402B1 (en) * 2017-10-31 2019-02-11 고려대학교산학협력단 Method and system for providing result of prospect of cancer treatment using artificial intelligence
WO2019086555A1 (en) * 2017-10-31 2019-05-09 Ge Healthcare Limited Medical system for diagnosing cognitive disease pathology and/or outcome
KR20190002059U (en) * 2018-02-05 2019-08-14 유정혜 Genetically customized drug prescription method using web Application

Also Published As

Publication number Publication date
JP7305028B2 (en) 2023-07-07
US20210065009A1 (en) 2021-03-04
JP2022542283A (en) 2022-09-30
US20230376773A1 (en) 2023-11-23
US20240005163A1 (en) 2024-01-04
US11783189B2 (en) 2023-10-10
DE112020004025T5 (en) 2022-07-21
WO2021041185A1 (en) 2021-03-04

Similar Documents

Publication Publication Date Title
US11727279B2 (en) Method and apparatus for performing anomaly detection using neural network
US20200134428A1 (en) Self-attentive attributed network embedding
EP3547226A1 (en) Cross-modal neural networks for prediction
Sumathi et al. Pre-diagnosis of hypertension using artificial neural network
JP7070255B2 (en) Abnormality discrimination program, abnormality discrimination method and abnormality discrimination device
US20230376774A1 (en) Adversarial Cooperative Imitation Learning for Dynamic Treatment
US20190243739A1 (en) Time series retrieval for analyzing and correcting system status
KR20160012537A (en) Neural network training method and apparatus, data processing apparatus
EP4104104A1 (en) Generative digital twin of complex systems
US11606393B2 (en) Node classification in dynamic networks using graph factorization
US20210232918A1 (en) Node aggregation with graph neural networks
US11169865B2 (en) Anomalous account detection from transaction data
JP2004033673A (en) Unified probability framework for predicting and detecting intracerebral stroke manifestation and multiple therapy device
WO2022005626A1 (en) Partially-observed sequential variational auto encoder
Hjerde Evaluating Deep Q-Learning Techniques for Controlling Type 1 Diabetes
Benyó et al. Artificial intelligence based insulin sensitivity prediction for personalized glycaemic control in intensive care
Vijayan et al. A cerebellum inspired spiking neural network as a multi-model for pattern classification and robotic trajectory prediction
KR20240006058A (en) Systems, methods, and devices for predicting personalized biological states with models generated by meta-learning
US20220019892A1 (en) Dialysis event prediction
US20200090025A1 (en) Performance prediction from communication data
Gireesh et al. Blood glucose prediction algorithms for hypoglycemic and/or hyperglycemic alerts
US11544377B2 (en) Unsupervised graph similarity learning based on stochastic subgraph sampling
Othman Spatial-temporal data modelling and processing for personalised decision support
Yang Deep Learning Model for Detection of Retinal Vessels from Digital Fundus Images-A Survey
Guo et al. Interactive Pattern Discovery in High-Dimensional, Multimodal Data Using Manifolds

Legal Events

Date Code Title Description
STPP Information on status: patent application and granting procedure in general

Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION

AS Assignment

Owner name: NEC LABORATORIES AMERICA, INC., NEW JERSEY

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:YU, WENCHAO;CHEN, HAIFENG;SIGNING DATES FROM 20200814 TO 20200815;REEL/FRAME:065133/0734