US12380360B2 - Interpretable imitation learning via prototypical option discovery for decision making - Google Patents

Interpretable imitation learning via prototypical option discovery for decision making

Info

Publication number
US12380360B2
US12380360B2 US17/323,475 US202117323475A US12380360B2 US 12380360 B2 US12380360 B2 US 12380360B2 US 202117323475 A US202117323475 A US 202117323475A US 12380360 B2 US12380360 B2 US 12380360B2
Authority
US
United States
Prior art keywords
option
learning
options
prototypical
policy
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active, expires
Application number
US17/323,475
Other versions
US20210374612A1 (en
Inventor
Wenchao Yu
Haifeng Chen
Wei Cheng
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
NEC Corp
Original Assignee
NEC Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by NEC Corp filed Critical NEC Corp
Assigned to NEC LABORATORIES AMERICA, INC. reassignment NEC LABORATORIES AMERICA, INC. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: YU, Wenchao, CHEN, HAIFENG, CHENG, WEI
Priority to US17/323,475 priority Critical patent/US12380360B2/en
Priority to JP2022572280A priority patent/JP7466702B2/en
Priority to PCT/US2021/033107 priority patent/WO2021242585A1/en
Publication of US20210374612A1 publication Critical patent/US20210374612A1/en
Priority to US19/230,344 priority patent/US20250299111A1/en
Priority to US19/230,357 priority patent/US20250299112A1/en
Assigned to NEC CORPORATION reassignment NEC CORPORATION ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: NEC LABORATORIES AMERICA, INC.
Publication of US12380360B2 publication Critical patent/US12380360B2/en
Application granted granted Critical
Active legal-status Critical Current
Adjusted expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/10Machine learning using kernel methods, e.g. support vector machines [SVM]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/004Artificial life, i.e. computing arrangements simulating life
    • G06N3/006Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/09Supervised learning
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/092Reinforcement learning
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/094Adversarial learning
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks

Definitions

  • the present invention relates to imitation learning and, more particularly, to methods and systems related to interpretable imitation learning via prototypical option discovery.
  • a method for learning prototypical options for interpretable imitation learning includes initializing options by bottleneck state discovery, each of the options presented by an instance of trajectories generated by experts, applying segmentation embedding learning to extract features to represent current states in segmentations by dividing the trajectories into a set of segmentations, learning prototypical options for each segment of the set of segmentations to mimic expert policies by minimizing loss of a policy and projecting prototypes to the current states, training option policy with imitation learning techniques to learn a conditional policy, generating interpretable policies by comparing the current states in the segmentations to one or more prototypical option embeddings, and taking an action based on the interpretable policies generated.
  • a non-transitory computer-readable storage medium comprising a computer-readable program for learning prototypical options for interpretable imitation learning.
  • the computer-readable program when executed on a computer causes the computer to perform the steps of initializing options by bottleneck state discovery, each of the options presented by an instance of trajectories generated by experts, applying segmentation embedding learning to extract features to represent current states in segmentations by dividing the trajectories into a set of segmentations, learning prototypical options for each segment of the set of segmentations to mimic expert policies by minimizing loss of a policy and projecting prototypes to the current states, training option policy with imitation learning techniques to learn a conditional policy, generating interpretable policies by comparing the current states in the segmentations to one or more prototypical option embeddings, and taking an action based on the interpretable policies generated.
  • a method for learning prototypical options for interpretable imitation learning includes dividing a task, by a processor, into a plurality of sub-tasks via a learning policy over options, learning, by the processor, different options to solve each of the plurality of sub-tasks by mimicking expert policy, and fine-tuning the learning policy to learn to take an action based on the task.
  • FIG. 1 is a block/flow diagram of an exemplary option selection mechanism, in accordance with embodiments of the present invention
  • FIG. 2 is a block/flow diagram of an exemplary prototypical option discovery for interpretable imitation learning (IPOD) architecture, in accordance with embodiments of the present invention
  • FIG. 3 is a block/flow diagram of an exemplary method for employing the IPOD architecture of FIG. 2 , in accordance with embodiments of the present invention
  • FIG. 4 is a block/flow diagram of an exemplary method for employing the option initialization, segmentation embedding learning, prototypical option learning, and option policy learning components of FIG. 3 , in accordance with embodiments of the present invention
  • FIG. 5 is a block/flow diagram of a practical application of the IPOD architecture, in accordance with embodiments of the present invention.
  • FIG. 6 is an exemplary processing system for the IPOD architecture, in accordance with embodiments of the present invention.
  • FIG. 7 is a block/flow diagram of an exemplary method for executing the IPOD architecture, in accordance with embodiments of the present invention.
  • FIG. 8 illustrates exemplary equations for implementing the IPOD architecture, in accordance with embodiments of the present invention.
  • Imitation learning which mimics experts' behaviors is beneficial to finding meaningful structure or skills in the experts' demonstrations.
  • they are usually considered as “black-boxes” which lack transparency, limiting their application in many decision-making scenarios, e.g., healthcare and finance.
  • a variety of methods learn a hidden variable of the variation underlying expert demonstrations to construct the structure of expert policy and visualize the changes in the hidden variable.
  • post-hoc explanations do not explain the reasoning process of how the model makes its decisions and can be incomplete or inaccurate in capturing the reasoning process of the original model. Therefore, it is often desirable to have models with built-in interpretability.
  • the exemplary embodiments address such issues by defining a form of interpretability in imitation learning that imitates human abstraction and explains its reasoning in a human-understanding manner.
  • the exemplary methods employ prototype learning to discovery options for built-in interpretable imitation learning.
  • Prototype learning which drives from the study of human reasoning, is a form of case-based reasoning, which makes decisions by comparing new inputs with a few data instances (prototypes) in, e.g., image recognition, sequence classification, sequence segmentation, etc.
  • the exemplary methods discover prototypical options for interpretable imitation learning.
  • the exemplary methods introduce a network architecture referred to as prototypical option discovery (IPOD).
  • IPOD prototypical option discovery
  • Each prototypical option is responsible for explaining a group of variable-length segments of the demonstration trajectories.
  • IPOD uses LSTM with a soft-attention mechanism to derive segment embedding.
  • the exemplary methods learn a prototypical contextual policy to take action with states as well as the option embedding, which is determined based on centroids of the segment embedding, as inputs.
  • the model is interpretable, in the sense that it has a transparent reasoning process when making decisions.
  • the exemplary methods define several criteria for constructing the prototypes, including option diversity and prediction accuracy.
  • the exemplary embodiments introduce an imitation learning framework that learns interpretable policy via prototypical options which include segmentation prototypes.
  • the exemplary embodiments enable learning the prototypical option embedding by weighted segmentation for sparsity and learn the prototypical option's policy by driving the option-relevant information via option embedding.
  • the goal is to learn a new policy it, which imitates the expert behavior by maximizing the likelihood of given demonstration trajectories.
  • the behavior of an expert agent can be copied to accomplish a desired task.
  • Imitation learning refers to learning a policy that mimics the behavior of experts who demonstrate how to perform the given task.
  • Imitation learning has various approaches.
  • One approach is behavior cloning (BC), which directly maps from the state to the action. This method usually learns a policy through standard supervised learning. BC does not perform any additional policy interactions with the learning environment, but it suffers from distributional drift.
  • Another approach is inverse reinforcement learning (IRL), which learns a policy by recovering the reward function from demonstrations and with dense reward signals provided from the learned reward function.
  • AIL adversarial imitation learning
  • IRL require interacting with the environment for generating the agent's trajectory for comparison with the expert's trajectory.
  • imitation learning with neural networks efficiently learns a desired behavior in complex environments.
  • these methods are usually considered as “black-boxes,” which lack transparency.
  • the exemplary methods introduce an interpretable imitation learning framework for more applications of imitation learning, e.g., healthcare, finance, etc.
  • An option is a generalization of an action (also known as a skill, sub-policy or a sub-goal).
  • an option is a three-tuple that includes the start, end probability of an option and the policy of the option.
  • Options offer great potential for mitigating the difficulty of solving complex Markov decision processes (MDPs) via temporally extended actions.
  • Interpretable modeling mainly falls into two categories, that is, intrinsic explanation which makes the model transparent by restricting the complexity, e.g., decision tree or case-based (prototype-based) model, and post-hoc explanation, which is achieved by analyzing the model after training, e.g., extracting the importance of states via attention and distilling a black-box policy into a simple structure policy.
  • a set of post-hoc imitation learning was proposed for generating meaningful policy.
  • the intrinsic explanation model is sometimes desirable since post-hoc explanations usually do not fit the original model precisely.
  • Prototype learning which draws conclusions for new inputs by comparing them with a few exemplary cases (e.g., prototypes) belongs to the intrinsic explanation method.
  • the options framework models skills as options, which is a closed-loop policy to solve the sub-tasks. For example, picking up an object, jumping, etc. are options, which require a user to take actions over a period of time.
  • An option o includes the following components, that is, its initiation condition, I o (s), which determines whether o can be executed in state s, its termination condition, ⁇ o (s), which determines whether option execution must terminate in state s and its closed-loop control policy, ⁇ o (s), which maps state s to a low-level action a.
  • Prototype theory emerged in 1971 with the work of psychologist Eleanor Rosch, and it has been described as a “Copernican revolution” in the theory of categorization.
  • prototype theory any given concept in any given language has a real-world example that best represents this concept. For instance, when asked to give an example of the concept of fruits, an apple is more frequently cited than, a durian. This theory claims that the presumed natural prototypes were central tendencies of the categories.
  • Prototype theory has also been applied in machine learning, where a prototype is defined as a data instance that is representative of all the data. There are many approaches to find prototypes in the data. Any clustering algorithm that returns actual data points as cluster centers would qualify for selecting prototypes.
  • a prototypical option o is a kind of option that can be presented by an instance of the trajectories generated by the experts.
  • a prototypical option o includes four components ⁇ I o , ⁇ o , ⁇ o , g o >, that is, an intra-option policy ⁇ o : ⁇ ⁇ [0, 1], a termination condition ⁇ o : p ⁇ [0, 1], an initiation state set I o ⁇ and an option prototype g o .
  • a prototypical option ⁇ I o , ⁇ o , ⁇ o , g o > is available in state s t if and only if s t ⁇ I o . If the option is taken, then actions are selected according to ⁇ o until the option terminates according to ⁇ o .
  • g o is considered as a real-world example to explain the option.
  • Options discovery is based on the intuition that it would be easier to solve the long-horizon task from temporal abstraction, e.g., separate or divide the long-horizon task into a set of sub-tasks, and select different options to solve for each sub-task.
  • This intuition informs the steps of the algorithm, that is, breaking or dividing the trajectories into a set of subtasks via learning a policy ⁇ h over options, learning (or discovering) options that could solve these sub-tasks by mimicking the expert' policy, and, once such options are learned, the exemplary embodiments fine-tune ⁇ h to learn to take an option based on the current task.
  • the goal is to first break or divide trajectories ⁇ into M disjoint segments (g 1 , g 2 , . . . , g M ), where
  • the exemplary embodiments leverage prototype learning to introduce an interpretable imitation learning framework by prototypical option discovery, where each prototypical option is responsible for explaining a group of variable-length segments of the demonstration trajectory.
  • I2L 200 addresses interpretable imitation learning tasks with steps to learn prototypical options ⁇ I o , ⁇ o , ⁇ o , g o >.
  • the exemplary methods learn a policy ⁇ h (o
  • the exemplary methods map each segment into an option embedding
  • the exemplary methods learn a prototypical contextual policy ⁇ (a
  • the O(s t ) is updated according to the ⁇ h (o
  • IPOD 200 in FIG.
  • s) is learned by choosing the admissible prototypical option. Since the exemplary methods utilize imitation learning to learn the intra-option policy, the reward of ⁇ h (o
  • s) is obtained by the selected option ⁇ o which takes primitive actions and receives the reward signal. Thus, the reward of the option is the cumulative reward of the actions taken from a current time to the termination of the option: r t:t+ ⁇ r t +. . . +r ⁇ t+ ⁇ ,
  • ⁇ [0,T] is the time interval of the option t+ ⁇ is the termination of the option o t .
  • the option-value Q(s t , o t ) refers to the expected rewards for an action o t taken in a given state s t .
  • the above equations show how the exemplary methods can learn the policy ⁇ h over option and use it for selecting options.
  • the exemplary methods must assign appropriate initial parameters to ⁇ h .
  • the exemplary methods segment the trajectories by detecting the bottleneck states within the trajectories. Bottlenecks have been defined as those states which appear frequently on successful trajectories to a goal but not on unsuccessful ones or as nodes which allow for densely connected regions of the interaction graph to reach other such regions.
  • bottleneck areas have been described as the border states of densely connected areas in the state space or as states that allow transitions to a different part of the environment.
  • a more formal definition defines bottleneck areas as those states which are local maxima of betweenness, a measure of centrality on graphs, on a transition graph.
  • the exemplary methods extract all the states in the trajectories, and use density-based spatial clustering methods (e.g., DBSCAN) to automatically cluster the states into K groups.
  • the exemplary methods aim to learn the option prototype, which is a sub-trajectory or segment generated by the experts. Each option prototype is responsible for explaining a group of variable-length segments of the demonstration trajectory g m generated by ⁇ h .
  • the exemplary methods map each group of segments g m,k individually into a low dimension embedding g m,k by classifying the segment into the corresponding option's category k.
  • the exemplary methods learn o k by minimizing the distance between o k and g m,k .
  • the exemplary methods consider the segment which has the smallest distance with o k as the option prototype of o k .
  • the exemplary methods aim to learn a meaningful latent space to represent the segments, where they are clustered (in L2-distance) around semantically similar prototypical options, and the clusters from different classes are well-separated.
  • the exemplary methods use a long short-term memory (LSTM) to learn the segment's representation
  • the exemplary methods minimize the distance between
  • the exemplary methods leverage both supervised learning and imitation learning regarding the effectiveness and interpretability.
  • the exemplary methods attempt to minimize the least square loss between g and o k , and prevent the learning of multiple similar prototypical options.
  • the exemplary methods use a diversity regularization term that penalizes prototypical options that are close to each other. Meanwhile, the exemplary methods also consider the downstream task (e.g., imitation learning).
  • the first term is for effectiveness, where an imitation learning objective function is conducted to learn the segment embeddings and option prototype embeddings to mimic expert's policy ⁇ E .
  • IM loss (reproduced below) can be any imitation learning method, e.g., a behavior cloning loss or an adversarial imitation learning objective.
  • the second term is for interpretability where an evidence regularization is used to encourage each prototypical option embedding to be as close to an encoded instance as possible.
  • the third term is a diversity regularization term to learn diversified options, where d min is a threshold that classifies or determines whether two prototypes are close or not. d min is set to 1.0 in exemplary embodiments. ⁇ 1 , ⁇ 2 , ⁇ 3 ⁇ [0, 1] are the weights used to balance the three loss terms.
  • each option o maintains its own policy ⁇ o :s ⁇ a t , which is parameterized by its own parameters ⁇ o .
  • the exemplary methods propose a contextual policy ⁇ ⁇ (a t
  • the exemplary methods train the option policy ⁇ ⁇ (a t
  • the goal of adversarial imitation learning is to minimize the JS divergence between trajectory distribution generated by the expert's policy and the option's policy.
  • the exemplary methods use the same policy loss for both option prototypes and option policy, but the exemplary methods only optimize the parameters of option prototypes or option policy for each optimization step.
  • w 1 , w 2 , w 3 ⁇ [0, 1] are hyper-parameters to balance the weights of the three kinds of loss.
  • the exemplary methods first initialize K groups segments followed by iteratively optimizing option + IL loss + emb .
  • the exemplary embodiments introduce an interpretable imitation learning framework by discovering compositional structure which is called prototypical option discovery imitation learning (IPOD).
  • IPOD constructs prototypical options which embed the skills of experts by an option embedding and an option policy via a prototype learning framework.
  • IPOD generates interpretable agent policies by comparing the state segmentations to a few prototypical option embeddings followed by taking an action based on the option embedding.
  • the exemplary model of the present invention uses a soft attention mechanism to derive prototypical option embedding from trajectory fragments.
  • the exemplary methods also use the soft attention mechanism to create a bottleneck in the agent, forcing it to focus on option-relevant information.
  • FIG. 3 is a block/flow diagram of an exemplary method 300 for employing the IPOD architecture of FIG. 2 , in accordance with embodiments of the present invention.
  • IPOD interpretable imitation learning
  • option initialization takes place:
  • the IPOD first initializes the options by bottleneck state discovery methodology.
  • the exemplary methods identify states that connect different densely connected regions in the state space.
  • the exemplary methods use the behavior cloning method with soft attention mechanism to obtain important states with large attention weights.
  • the important states can then be found with DBSCAN clustering.
  • the dense clusters derived from DBSCAN are used for option initialization.
  • the policy over options learning takes place:
  • a prototypical option o includes four components ⁇ I o , ⁇ o , ⁇ o , g o >, an intra-option policy ⁇ o : ⁇ ⁇ [0, 1], a termination condition ⁇ o : ⁇ [0,1], an initiation state set I 0 ⁇ , and its option prototype g o .
  • s) is learned to choose the admissible prototypical option. Since the exemplary methods utilize imitation learning to learn the intra-option policy, the reward of ⁇ h (o
  • s) is obtained by the selected option ⁇ o which takes primitive actions and receives the reward signal. Thus, the reward of the option is the cumulative reward of the actions taken from a current time to the termination of the option: r ⁇ t:t+ ⁇ r t + . . . r ⁇ t+ ⁇ , where ⁇ [0, T] is the time interval of the option on-going, and t+ ⁇ is the termination of the option o t .
  • the exemplary methods update ⁇ h (o
  • s) taking option o t at state s t according to policy gradient: ⁇ J s ⁇ h [Q ( s, ⁇ h ( o t
  • option-value Q (s t , o t ) refers to the expected rewards for an action o t taken in a given state s t .
  • the exemplary methods aim to learn the option prototype, which is a sub-trajectory or segment generated by the experts. Each option prototype is responsible for explaining a group of variable-length segments of the demonstration trajectory g m , generated by ⁇ h .
  • the exemplary methods map each group of segment g m,k individually into a low-dimension embedding g m,k by classifying the segment into the corresponding option's category k.
  • the exemplary methods learn o k by minimizing the distance between o k and g m,k .
  • the exemplary methods consider the segment which has the smallest distance with o k as the option prototype of o k .
  • the exemplary methods aim to learn a meaningful latent space to represent the segments, where they are clustered (in L2-distance) around semantically similar prototypical options, and the clusters from different classes are well-separated.
  • the exemplary methods minimize the distance between g v m′ v m and its closest prototype o k .
  • the optimization problem to be solved is:
  • the exemplary methods leverage both supervised learning and imitation learning regarding effectiveness and interpretability.
  • the exemplary methods try to minimize the least square loss between g and o k to prevent learning multiple similar prototypical options.
  • the exemplary methods use a diversity regularization term that penalizes prototypical options that are close to each other. Meanwhile, the exemplary methods also consider the downstream task (imitation learning).
  • the second term is for interpretability where an evidence regularization is used to encourage each prototypical option embedding to be as close to an encoded instance as possible.
  • the third term is a diversity regularization to learn diversified options and d min is a threshold that classifies whether two prototypes are close or not.
  • the exemplary methods set d min to 1.0. ⁇ 1 , ⁇ 2 , ⁇ 3 ⁇ [0,1] are the weights used to balance the three loss terms.
  • option policy learning takes place:
  • Each option o maintains its own policy ⁇ o :s ⁇ a t , which is parameterized by its own parameters ⁇ o .
  • the exemplary methods propose a contextual policy ⁇ ⁇ (a t
  • the exemplary methods train the option policy ⁇ o (a t
  • the goal of behavior cloning is to mimic the action of the expert at each time step via supervised learning technical.
  • the goal of adversarial imitation learning is to minimize the JS divergence between trajectory distribution generated by the expert's policy and the option's policy.
  • option policy loss is used for both option prototypes and option policy, but the exemplary methods only optimize the parameters of option prototypes or option policy for each optimization step.
  • the exemplary methods can further train the option policy with imitation learning algorithms, e.g., behavior cloning and adversarial imitation learning.
  • the goal of option policy learning is to mimic the segmentations of demonstrations from the experts.
  • FIG. 4 is a block/flow diagram of an exemplary method for employing the option initialization, segmentation embedding learning, prototypical option learning, and option policy learning components of FIG. 3 , in accordance with embodiments of the present invention.
  • Imitation learning with neural networks efficiently learns a desired behavior in complex environments.
  • these methods are usually considered as “black-boxes” which lack transparency, limiting their application in many decision-making scenarios.
  • a variety of methods learn a hidden variable of the variation underlying expert demonstrations to construct the structure of expert policy and visualize the changes in the hidden variable.
  • post-hoc explanations do not explain the reasoning process of how the model makes its decisions and can be incomplete or inaccurate in capturing the reasoning process of the original model. Therefore, it is often desirable to have models with built-in interpretability.
  • the exemplary embodiments of the present invention define a form of interpretability in imitation learning that imitates human abstraction and explains its reasoning in a human-understanding manner.
  • the exemplary methods enable prototype learning to discovery options for built-in interpretable imitation learning, which makes decisions by comparing the new inputs with a few data instances (prototypes).
  • attention mechanics and behavior cloning are utilized to extract the most important states considered while mimicking the expert's demonstration.
  • DBSCAN is used on the extracted states and the states are automatically clustered into groups.
  • imitation learning is utilized to learn the intra-option policy, where the reward is calculated by the cumulative rewards from the primitive actions.
  • prototypical options are learned via minimizing the loss of the policy and projecting the prototypes to observed states.
  • the option policy is trained with imitation learning algorithms, such as behavior cloning, inverse imitation learning and adversarial imitation learning.
  • the exemplary methods introduce a new architecture, that is, prototypical option discovery for interpretable imitation learning (IPOD).
  • Each prototypical option includes a set of segmentation from experts' trajectories and is embedded by an option policy.
  • the IPOD uses a soft attention mechanism to derive prototypical option embedding from its trajectory fragments.
  • the model matches the segmentations from the demonstration to the learned prototypical options, and makes an action based on the learned prototypical option.
  • the exemplary methods also use the soft attention mechanism to create a bottleneck in the agent, forcing the agent to focus on option-relevant information. In this way, the model is interpretable, in the sense that it has a transparent reasoning process when making decisions.
  • the exemplary methods define several criteria for constructing the prototypes, including option diversity and accuracy.
  • Bottleneck state discovery segments the input trajectories into disjoint segments of variable length by, e.g., density-based clustering methods.
  • Option projection includes representation learning of the segmentations in each cluster, and prototypical option embedding learning.
  • Option refixation takes the low-level actions controlled through the prototypical option embedding and refines each group of segments by matching the segmentation embeddings to prototypical option embeddings.
  • FIG. 5 is a block/flow diagram 500 of a practical application of the IPOD architecture, in accordance with embodiments of the present invention.
  • a patient 502 needs to receive medication 504 .
  • Options are computed for indicating different levels of dosages of the medication 504 .
  • the exemplary methods learn a prototypical contextual policy ⁇ (a
  • the IPOD architecture 670 is implemented to enable prototypical option visualization by executing a reasoning process 555 and evaluating policy performance 557 .
  • I2L 670 via the reasoning process 555 , can smoothly compose the different options by considering the variant states 506 of the patient 502 . In one instance, I2L 670 can chose the low-dosage option for the patient 502 .
  • the results 510 e.g., dosage options
  • FIG. 6 is an exemplary processing system for GBL, in accordance with embodiments of the present invention.
  • the processing system includes at least one processor (CPU) 604 operatively coupled to other components via a system bus 602 .
  • a GPU 605 operatively coupled to the system bus 602 .
  • a GPU 605 operatively coupled to the system bus 602 .
  • ROM Read Only Memory
  • RAM Random Access Memory
  • I/O input/output
  • an interpretable imitation learning framework 670 can be employed to execute option initialization 303 , policy over options learning 305 , prototypical option learning 307 , prototypical option embedding learning 309 , and option policy learning 311 .
  • a storage device 622 is operatively coupled to system bus 602 by the I/O adapter 620 .
  • the storage device 622 can be any of a disk storage device (e.g., a magnetic or optical disk storage device), a solid-state magnetic device, and so forth.
  • a transceiver 632 is operatively coupled to system bus 602 by network adapter 630 .
  • User input devices 642 are operatively coupled to system bus 602 by user interface adapter 640 .
  • the user input devices 642 can be any of a keyboard, a mouse, a keypad, an image capture device, a motion sensing device, a microphone, a device incorporating the functionality of at least two of the preceding devices, and so forth. Of course, other types of input devices can also be used, while maintaining the spirit of the present invention.
  • the user input devices 642 can be the same type of user input device or different types of user input devices.
  • the user input devices 642 are used to input and output information to and from the processing system.
  • a display device 652 is operatively coupled to system bus 602 by display adapter 650 .
  • the processing system may also include other elements (not shown), as readily contemplated by one of skill in the art, as well as omit certain elements.
  • various other input devices and/or output devices can be included in the system, depending upon the particular implementation of the same, as readily understood by one of ordinary skill in the art.
  • various types of wireless and/or wired input and/or output devices can be used.
  • additional processors, controllers, memories, and so forth, in various configurations can also be utilized as readily appreciated by one of ordinary skill in the art.
  • FIG. 7 is a block/flow diagram of an exemplary method for executing the IPOD architecture, in accordance with embodiments of the present invention.
  • FIG. 8 illustrates exemplary equations 800 for implementing the IPOD architecture, in accordance with embodiments of the present invention.
  • the equations include a loss function for segmentation embedding learning, an objective function, and policy losses.
  • the terms “data,” “content,” “information” and similar terms can be used interchangeably to refer to data capable of being captured, transmitted, received, displayed and/or stored in accordance with various example embodiments. Thus, use of any such terms should not be taken to limit the spirit and scope of the disclosure.
  • a computing device is described herein to receive data from another computing device, the data can be received directly from the another computing device or can be received indirectly via one or more intermediary computing devices, such as, for example, one or more servers, relays, routers, network access points, base stations, and/or the like.
  • the data can be sent directly to the another computing device or can be sent indirectly via one or more intermediary computing devices, such as, for example, one or more servers, relays, routers, network access points, base stations, and/or the like.
  • intermediary computing devices such as, for example, one or more servers, relays, routers, network access points, base stations, and/or the like.
  • aspects of the present invention may be embodied as a system, method or computer program product. Accordingly, aspects of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “circuit,” “module,” “calculator,” “device,” or “system.” Furthermore, aspects of the present invention may take the form of a computer program product embodied in one or more computer readable medium(s) having computer readable program code embodied thereon.
  • the computer readable medium may be a computer readable signal medium or a computer readable storage medium.
  • a computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing.
  • a computer readable storage medium may be any tangible medium that can include, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
  • a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated signal may take any of a variety of forms, including, but not limited to, electromagnetic, optical, or any suitable combination thereof.
  • a computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
  • Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
  • Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C++ or the like and conventional procedural programming languages, such as the “C” programming language or similar programming languages.
  • the program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server.
  • the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
  • LAN local area network
  • WAN wide area network
  • Internet Service Provider for example, AT&T, MCI, Sprint, EarthLink, MSN, GTE, etc.
  • These computer program instructions may also be stored in a computer readable medium that can direct a computer, other programmable data processing apparatus, or other devices to function in a particular manner, such that the instructions stored in the computer readable medium produce an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks or modules.
  • the computer program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks or modules.
  • processor as used herein is intended to include any processing device, such as, for example, one that includes a CPU (central processing unit) and/or other processing circuitry. It is also to be understood that the term “processor” may refer to more than one processing device and that various elements associated with a processing device may be shared by other processing devices.
  • memory as used herein is intended to include memory associated with a processor or CPU, such as, for example, RAM, ROM, a fixed memory device (e.g., hard drive), a removable memory device (e.g., diskette), flash memory, etc. Such memory may be considered a computer readable storage medium.
  • input/output devices or “I/O devices” as used herein is intended to include, for example, one or more input devices (e.g., keyboard, mouse, scanner, etc.) for entering data to the processing unit, and/or one or more output devices (e.g., speaker, display, printer, etc.) for presenting results associated with the processing unit.
  • input devices e.g., keyboard, mouse, scanner, etc.
  • output devices e.g., speaker, display, printer, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Image Analysis (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A method for learning prototypical options for interpretable imitation learning is presented. The method includes initializing options by bottleneck state discovery, each of the options presented by an instance of trajectories generated by experts, applying segmentation embedding learning to extract features to represent current states in segmentations by dividing the trajectories into a set of segmentations, learning prototypical options for each segment of the set of segmentations to mimic expert policies by minimizing loss of a policy and projecting prototypes to the current states, training option policy with imitation learning techniques to learn a conditional policy, generating interpretable policies by comparing the current states in the segmentations to one or more prototypical option embeddings, and taking an action based on the interpretable policies generated.

Description

RELATED APPLICATION INFORMATION
This application claims priority to Provisional Application Nos. 63/029,754, filed on May 26, 2020, and 63/033,304, filed on Jun. 2, 2020, the contents of both of which are incorporated herein by reference in their entirety.
BACKGROUND Technical Field
The present invention relates to imitation learning and, more particularly, to methods and systems related to interpretable imitation learning via prototypical option discovery.
Description of the Related Art
Humans have the ability to compose options or skills to solve a complex problem. For example, to treat a COVID-19 patient with a critical condition, an intensive care unit (ICU) doctor needs to compose essential skills such as endotracheal intubation, chest-tube placement, and arterial and central venous catheterization. Discovering the compositional structures from experts' trajectories is beneficial to understand the experts' policy as well as learn a new policy.
SUMMARY
A method for learning prototypical options for interpretable imitation learning is presented. The method includes initializing options by bottleneck state discovery, each of the options presented by an instance of trajectories generated by experts, applying segmentation embedding learning to extract features to represent current states in segmentations by dividing the trajectories into a set of segmentations, learning prototypical options for each segment of the set of segmentations to mimic expert policies by minimizing loss of a policy and projecting prototypes to the current states, training option policy with imitation learning techniques to learn a conditional policy, generating interpretable policies by comparing the current states in the segmentations to one or more prototypical option embeddings, and taking an action based on the interpretable policies generated.
A non-transitory computer-readable storage medium comprising a computer-readable program for learning prototypical options for interpretable imitation learning is presented. The computer-readable program when executed on a computer causes the computer to perform the steps of initializing options by bottleneck state discovery, each of the options presented by an instance of trajectories generated by experts, applying segmentation embedding learning to extract features to represent current states in segmentations by dividing the trajectories into a set of segmentations, learning prototypical options for each segment of the set of segmentations to mimic expert policies by minimizing loss of a policy and projecting prototypes to the current states, training option policy with imitation learning techniques to learn a conditional policy, generating interpretable policies by comparing the current states in the segmentations to one or more prototypical option embeddings, and taking an action based on the interpretable policies generated.
A method for learning prototypical options for interpretable imitation learning is presented. The method includes dividing a task, by a processor, into a plurality of sub-tasks via a learning policy over options, learning, by the processor, different options to solve each of the plurality of sub-tasks by mimicking expert policy, and fine-tuning the learning policy to learn to take an action based on the task.
These and other features and advantages will become apparent from the following detailed description of illustrative embodiments thereof, which is to be read in connection with the accompanying drawings.
BRIEF DESCRIPTION OF DRAWINGS
The disclosure will provide details in the following description of preferred embodiments with reference to the following figures wherein:
FIG. 1 is a block/flow diagram of an exemplary option selection mechanism, in accordance with embodiments of the present invention;
FIG. 2 is a block/flow diagram of an exemplary prototypical option discovery for interpretable imitation learning (IPOD) architecture, in accordance with embodiments of the present invention;
FIG. 3 is a block/flow diagram of an exemplary method for employing the IPOD architecture of FIG. 2 , in accordance with embodiments of the present invention;
FIG. 4 is a block/flow diagram of an exemplary method for employing the option initialization, segmentation embedding learning, prototypical option learning, and option policy learning components of FIG. 3 , in accordance with embodiments of the present invention;
FIG. 5 is a block/flow diagram of a practical application of the IPOD architecture, in accordance with embodiments of the present invention;
FIG. 6 is an exemplary processing system for the IPOD architecture, in accordance with embodiments of the present invention;
FIG. 7 is a block/flow diagram of an exemplary method for executing the IPOD architecture, in accordance with embodiments of the present invention; and
FIG. 8 illustrates exemplary equations for implementing the IPOD architecture, in accordance with embodiments of the present invention.
DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS
Imitation learning which mimics experts' behaviors is beneficial to finding meaningful structure or skills in the experts' demonstrations. Despite the superior performance of imitation learning models, they are usually considered as “black-boxes” which lack transparency, limiting their application in many decision-making scenarios, e.g., healthcare and finance. A variety of methods learn a hidden variable of the variation underlying expert demonstrations to construct the structure of expert policy and visualize the changes in the hidden variable. However, post-hoc explanations do not explain the reasoning process of how the model makes its decisions and can be incomplete or inaccurate in capturing the reasoning process of the original model. Therefore, it is often desirable to have models with built-in interpretability.
The exemplary embodiments address such issues by defining a form of interpretability in imitation learning that imitates human abstraction and explains its reasoning in a human-understanding manner. The exemplary methods employ prototype learning to discovery options for built-in interpretable imitation learning. Prototype learning, which drives from the study of human reasoning, is a form of case-based reasoning, which makes decisions by comparing new inputs with a few data instances (prototypes) in, e.g., image recognition, sequence classification, sequence segmentation, etc.
The exemplary methods discover prototypical options for interpretable imitation learning. The exemplary methods introduce a network architecture referred to as prototypical option discovery (IPOD). Each prototypical option is responsible for explaining a group of variable-length segments of the demonstration trajectories. To learn the prototypical options, IPOD first learns a policy to break the trajectories into a set of segmentations, which results in K groups of segments for the K prototypical options. IPOD uses LSTM with a soft-attention mechanism to derive segment embedding. For each group of segments, the exemplary methods learn a prototypical contextual policy to take action with states as well as the option embedding, which is determined based on centroids of the segment embedding, as inputs. In this way, the model is interpretable, in the sense that it has a transparent reasoning process when making decisions. For better interpretability, the exemplary methods define several criteria for constructing the prototypes, including option diversity and prediction accuracy.
The exemplary embodiments introduce an imitation learning framework that learns interpretable policy via prototypical options which include segmentation prototypes. The exemplary embodiments enable learning the prototypical option embedding by weighted segmentation for sparsity and learn the prototypical option's policy by driving the option-relevant information via option embedding. The goal is to learn a new policy it, which imitates the expert behavior by maximizing the likelihood of given demonstration trajectories. Thus, the behavior of an expert agent can be copied to accomplish a desired task.
Imitation learning refers to learning a policy that mimics the behavior of experts who demonstrate how to perform the given task. The behavior of the expert demonstrator is represented by trajectories τ=[s0, a0 . . . , sT, aT], which is a sequence of state action pairs. Imitation learning has various approaches. One approach is behavior cloning (BC), which directly maps from the state to the action. This method usually learns a policy through standard supervised learning. BC does not perform any additional policy interactions with the learning environment, but it suffers from distributional drift. Another approach is inverse reinforcement learning (IRL), which learns a policy by recovering the reward function from demonstrations and with dense reward signals provided from the learned reward function. However, the learned policy is valid only while the learned reward function is valid. Yet another approach is adversarial imitation learning (AIL), which constrains the behavior of the agent to be approximately optimal with an unknown reward function without explicitly attempting to recover that reward function. However, both AIL and IRL require interacting with the environment for generating the agent's trajectory for comparison with the expert's trajectory. Recently, imitation learning with neural networks efficiently learns a desired behavior in complex environments. However, these methods are usually considered as “black-boxes,” which lack transparency. The exemplary methods introduce an interpretable imitation learning framework for more applications of imitation learning, e.g., healthcare, finance, etc.
An option is a generalization of an action (also known as a skill, sub-policy or a sub-goal). Formally, an option is a three-tuple that includes the start, end probability of an option and the policy of the option. Options offer great potential for mitigating the difficulty of solving complex Markov decision processes (MDPs) via temporally extended actions.
Interpretable modeling mainly falls into two categories, that is, intrinsic explanation which makes the model transparent by restricting the complexity, e.g., decision tree or case-based (prototype-based) model, and post-hoc explanation, which is achieved by analyzing the model after training, e.g., extracting the importance of states via attention and distilling a black-box policy into a simple structure policy. A set of post-hoc imitation learning was proposed for generating meaningful policy. However, the intrinsic explanation model is sometimes desirable since post-hoc explanations usually do not fit the original model precisely. Prototype learning, which draws conclusions for new inputs by comparing them with a few exemplary cases (e.g., prototypes) belongs to the intrinsic explanation method.
The options framework models skills as options, which is a closed-loop policy to solve the sub-tasks. For example, picking up an object, jumping, etc. are options, which require a user to take actions over a period of time. An option o includes the following components, that is, its initiation condition, Io (s), which determines whether o can be executed in state s, its termination condition, βo (s), which determines whether option execution must terminate in state s and its closed-loop control policy, πo (s), which maps state s to a low-level action a.
Prototype theory emerged in 1971 with the work of psychologist Eleanor Rosch, and it has been described as a “Copernican revolution” in the theory of categorization. In prototype theory, any given concept in any given language has a real-world example that best represents this concept. For instance, when asked to give an example of the concept of fruits, an apple is more frequently cited than, a durian. This theory claims that the presumed natural prototypes were central tendencies of the categories. Prototype theory has also been applied in machine learning, where a prototype is defined as a data instance that is representative of all the data. There are many approaches to find prototypes in the data. Any clustering algorithm that returns actual data points as cluster centers would qualify for selecting prototypes.
The exemplary embodiments introduce the formulation of the prototypical option, which is a kind of option that can be presented by an instance of the trajectories generated by the experts. A prototypical option o includes four components <Io, πo, βo, go>, that is, an intra-option policy πo:
Figure US12380360-20250805-P00001
×
Figure US12380360-20250805-P00002
→[0, 1], a termination condition βo:
Figure US12380360-20250805-P00003
p→[0, 1], an initiation state set Io
Figure US12380360-20250805-P00004
and an option prototype go.
Specifically, go is defined by sub-trajectories generated by the experts. Given the trajectories of the expert τ={s1, a1, . . . , sT, aT}, the prototypical option is a set of segments (g1, g2, . . . gK), where
k = s υ m : υ m , m = m - 1.
Here, vm∈[1,T] are segment boundary indicator variables with v0=0, vm=T, vm≥vm′, e.g., go=s2:4, so that go=[s2,s3,s4].
A prototypical option <Io, πo, βo, go> is available in state st if and only if st∈Io. If the option is taken, then actions are selected according to πo until the option terminates according to βo. In a prototypical option, go is considered as a real-world example to explain the option.
Options discovery is based on the intuition that it would be easier to solve the long-horizon task from temporal abstraction, e.g., separate or divide the long-horizon task into a set of sub-tasks, and select different options to solve for each sub-task. This intuition informs the steps of the algorithm, that is, breaking or dividing the trajectories into a set of subtasks via learning a policy πh over options, learning (or discovering) options that could solve these sub-tasks by mimicking the expert' policy, and, once such options are learned, the exemplary embodiments fine-tune πh to learn to take an option based on the current task.
Formally, given the trajectories of the expert τ={s1, a1, . . . , sT, aT}, the goal is to first break or divide trajectories τ into M disjoint segments (g1, g2, . . . , gM), where
k = s υ m : υ m , m , s υ m : υ m , m = ( s υ m + 1 , , s υ m ) ,
m′=m−1. Here, vm∈[1, T] are segment boundary indicator variables with v0=0, vM=T, vm≥vm′. The segments are grouped into K clusters and learn each cluster's prototypical options, where Gk={gm}mE{0, 1, . . . , M} indicate the m-th group segments.
The exemplary embodiments leverage prototype learning to introduce an interpretable imitation learning framework by prototypical option discovery, where each prototypical option is responsible for explaining a group of variable-length segments of the demonstration trajectory. As presented in FIG. 2 , I2L 200 addresses interpretable imitation learning tasks with steps to learn prototypical options <Io, πo, βo, go>. To learn the initial state set Io and the termination condition βo, the exemplary methods learn a policy πh (o|s) over options to break or divide the trajectories into a set of segmentations, which results in K groups of segments for the K prototypical options. To learn the option prototype go, the exemplary methods map each segment into an option embedding
o ^ υ m : υ m
and cluster to them to find K central nodes as option prototypes go, o={1, . . . , K}. As for learning intra-option policy πo, the exemplary methods learn a prototypical contextual policy λ(a|s, o) to take action based on states, as well as the option embedding.
In options learning (Io and βo) step, πh(o|s) first constructs a set of admissible options given by:
Figure US12380360-20250805-P00005
(st)={oi|Io i (st)=1∩βo i (st)=0, ∀oi
Figure US12380360-20250805-P00005
}. Here the O(st) is updated according to the πh (o|s). IPOD 200, in FIG. 2 , determines the Ioi(st) and βoi(st) by the output of πh, e.g., ot, where if ot=1, Ioi(st)=1 and βoi(st)=0, otherwise Ioi(st)=0 and βoi(st)=1. An example of how the agent πh(o|s) selects an option is shown in structure 100 of FIG. 1 .
With regards to learning the policy over options, πh (o|s) is learned by choosing the admissible prototypical option. Since the exemplary methods utilize imitation learning to learn the intra-option policy, the reward of πh (o|s) is obtained by the selected option πo which takes primitive actions and receives the reward signal. Thus, the reward of the option is the cumulative reward of the actions taken from a current time to the termination of the option:
r t:t+δ =r t +. . . +r {t+δ},
where δ∈[0,T] is the time interval of the option t+δ is the termination of the option ot.
Given the transition (st, ot, rt:t+s) we update πh(o|s) taking option ot at state st according to policy gradient:
J=
Figure US12380360-20250805-P00006
s˜πh [Q(s,π h(o t |s t)]
where the option-value Q(st, ot) refers to the expected rewards for an action ot taken in a given state st. Updating options to the policy over options, the above equations show how the exemplary methods can learn the policy πh over option and use it for selecting options. However, before learning πh, the exemplary methods must assign appropriate initial parameters to πh. The exemplary methods segment the trajectories by detecting the bottleneck states within the trajectories. Bottlenecks have been defined as those states which appear frequently on successful trajectories to a goal but not on unsuccessful ones or as nodes which allow for densely connected regions of the interaction graph to reach other such regions. Informally, bottleneck areas have been described as the border states of densely connected areas in the state space or as states that allow transitions to a different part of the environment. A more formal definition defines bottleneck areas as those states which are local maxima of betweenness, a measure of centrality on graphs, on a transition graph.
The exemplary methods extract all the states in the trajectories, and use density-based spatial clustering methods (e.g., DBSCAN) to automatically cluster the states into K groups. In the exemplary methods, each state group indicates one option's valid states (where Io(s)=1. That is, the initial πh will take that option while it is in these states via behavior cloning.
In option prototype learning, the exemplary methods aim to learn the option prototype, which is a sub-trajectory or segment generated by the experts. Each option prototype is responsible for explaining a group of variable-length segments of the demonstration trajectory gm generated by πh. Thus, the exemplary methods first initialize K option prototype embedding ok
Figure US12380360-20250805-P00007
n, k={1, 2, 3, . . . , K} vectors as learnable parameters. Next, the exemplary methods map each group of segments gm,k individually into a low dimension embedding gm,k by classifying the segment into the corresponding option's category k. Meanwhile, the exemplary methods learn ok by minimizing the distance between ok and gm,k. Finally, the exemplary methods consider the segment which has the smallest distance with ok as the option prototype of ok.
Regarding segmentation embedding learning, the exemplary methods aim to learn a meaningful latent space to represent the segments, where they are clustered (in L2-distance) around semantically similar prototypical options, and the clusters from different classes are well-separated.
To achieve this, the exemplary methods use a long short-term memory (LSTM) to learn the segment's representation
υ m : υ m = f ϕ ( s υ m : υ m )
and the embeddings of prototypical option ok, where
s 𝓋 m : υ m
vm=t indicates the current segment generated by πh. To force the segment
𝓋 m : υ m
and the option prototypes to be in the same space, the exemplary methods minimize the distance between
s 𝓋 m : υ m ,
and its closest prototype ok.
The optimization problem the exemplary methods aim to solve is:
Figure US12380360-20250805-P00008
embm=1 M mink=1 K ∥f ϕ(s {v m′ :v m })−o k2 2,
The minimization of
Figure US12380360-20250805-P00008
emb encourages each training segment to have some latent patch that is close to at least one prototypical option. These terms shape the latent space into a semantically meaningful clustering structure.
Regarding option prototype embedding learning (go), since the option prototype embeddings ok=1 K are representations in the latent space, they are not readily interpretable. For interpretability, the exemplary methods assign each prototypical option embedding ok k=1 K with their closest segment embedding g in the training set.
As for learning option prototype embedding, the exemplary methods leverage both supervised learning and imitation learning regarding the effectiveness and interpretability. The exemplary methods attempt to minimize the least square loss between g and ok, and prevent the learning of multiple similar prototypical options. The exemplary methods use a diversity regularization term that penalizes prototypical options that are close to each other. Meanwhile, the exemplary methods also consider the downstream task (e.g., imitation learning).
The full objective function of option learning is given as follows:
option = - λ 1 * IL loss + λ 2 * i = 1 K min M m = 1 f ϕ ( s υ m : υ m ) - e i 2 2 ) + λ 3 * i = 1 K j = i + 1 K max ( 0 , d min - e i - e j )
where the first term is for effectiveness, where an imitation learning objective function is conducted to learn the segment embeddings and option prototype embeddings to mimic expert's policy πE.
Figure US12380360-20250805-P00008
IM loss (reproduced below) can be any imitation learning method, e.g., a behavior cloning loss or an adversarial imitation learning objective. The second term is for interpretability where an evidence regularization is used to encourage each prototypical option embedding to be as close to an encoded instance as possible. The third term is a diversity regularization term to learn diversified options, where dmin is a threshold that classifies or determines whether two prototypes are close or not. dmin is set to 1.0 in exemplary embodiments. λ1, λ2, λ3∈[0, 1] are the weights used to balance the three loss terms.
Regarding option policy learning π0, each option o maintains its own policy πo:s→at, which is parameterized by its own parameters θo. To reduce the parameter complexity, the exemplary methods propose a contextual policy ζθ(at|st, ok) to learn a conditional policy which is conditioned on both the state and the option, which is shared among all the options.
The exemplary methods train the option policy πθ(at|st, ok) via the traditional imitation learning algorithms defined as
Figure US12380360-20250805-P00008
IL loss , e.g., behavior cloning and adversarial imitation learning.
The goal of adversarial imitation learning is to minimize the JS divergence between trajectory distribution generated by the expert's policy and the option's policy.
Note that the exemplary methods use the same policy loss for both option prototypes and option policy, but the exemplary methods only optimize the parameters of option prototypes or option policy for each optimization step.
Regarding the full objective function, the loss minimized is:
Figure US12380360-20250805-P00008
Full =w 1·
Figure US12380360-20250805-P00008
option +w 2·
Figure US12380360-20250805-P00008
IL loss +w 3·
Figure US12380360-20250805-P00008
emb
where w1, w2, w3∈[0, 1] are hyper-parameters to balance the weights of the three kinds of loss. As for optimization, the exemplary methods first initialize K groups segments followed by iteratively optimizing
Figure US12380360-20250805-P00008
option+
Figure US12380360-20250805-P00008
IL loss +
Figure US12380360-20250805-P00008
emb.
Therefore, the exemplary embodiments introduce an interpretable imitation learning framework by discovering compositional structure which is called prototypical option discovery imitation learning (IPOD). IPOD constructs prototypical options which embed the skills of experts by an option embedding and an option policy via a prototype learning framework. IPOD generates interpretable agent policies by comparing the state segmentations to a few prototypical option embeddings followed by taking an action based on the option embedding. Unlike seeking a minimal subset of samples as prototypes that can serve as a distillation or condensed view of a data set, the exemplary model of the present invention uses a soft attention mechanism to derive prototypical option embedding from trajectory fragments. The exemplary methods also use the soft attention mechanism to create a bottleneck in the agent, forcing it to focus on option-relevant information.
FIG. 3 is a block/flow diagram of an exemplary method 300 for employing the IPOD architecture of FIG. 2 , in accordance with embodiments of the present invention.
Prototypical option discovery for interpretable imitation learning (IPOD) proposes to learn prototypical options for interpretable imitation. Each prototypical option is responsible for explaining a group of variable-length segments of the demonstration trajectory. The exemplary methods model each group of segments by computing distances to prototypical option embedding, where prototypical option embedding is a latent variable summarizing the segments. The IPOD model includes the following learning phases.
At block 303, option initialization takes place:
The IPOD first initializes the options by bottleneck state discovery methodology. Inspired by previous works on bottleneck state discovery, e.g., frequently visited states, the exemplary methods identify states that connect different densely connected regions in the state space. In order to discover such bottleneck states from expert demonstrations, the exemplary methods use the behavior cloning method with soft attention mechanism to obtain important states with large attention weights. The important states can then be found with DBSCAN clustering. The dense clusters derived from DBSCAN are used for option initialization.
At block 305, the policy over options learning takes place:
A prototypical option o includes four components <Io, πo, βo, go>, an intra-option policy πo:
Figure US12380360-20250805-P00009
×
Figure US12380360-20250805-P00010
→[0, 1], a termination condition βo:
Figure US12380360-20250805-P00011
→[0,1], an initiation state set I0
Figure US12380360-20250805-P00012
, and its option prototype go. To select an option in state st, πh(o|s) first constructs a set of admissible options given by:
Figure US12380360-20250805-P00013
(s t)={o i |I o i (s t)=1∩βo i (s t)=0,∀o i
Figure US12380360-20250805-P00014
}
Here the
Figure US12380360-20250805-P00015
(st) is updated according to the πh(o|s). IPOD determines the Io i (st) and βo i (s_t) by the output of πh, i.e., ot, where if of=1, Io i (st)=1 and βo i (st)=0. An example of how the agent πh(o|s) selects an option is shown above with respect to
Figure US12380360-20250805-P00016
(st).
πh(o|s) is learned to choose the admissible prototypical option. Since the exemplary methods utilize imitation learning to learn the intra-option policy, the reward of πh(o|s) is obtained by the selected option πo which takes primitive actions and receives the reward signal. Thus, the reward of the option is the cumulative reward of the actions taken from a current time to the termination of the option: r{t:t+δ}=rt+ . . . r{t+δ}, where δ∈[0, T] is the time interval of the option on-going, and t+δ is the termination of the option ot.
Given the transition (st, ot, rt:t+δ), the exemplary methods update πh(o|s) taking option ot at state st according to policy gradient:
J=
Figure US12380360-20250805-P00017
s˜πh [Q(s,π h(o t |s t)]
where the option-value Q (st, ot) refers to the expected rewards for an action ot taken in a given state st.
At block 307, prototypical option learning takes place:
In the second stage, the exemplary methods aim to learn the option prototype, which is a sub-trajectory or segment generated by the experts. Each option prototype is responsible for explaining a group of variable-length segments of the demonstration trajectory gm, generated by πh. Thus, the exemplary methods first initialize K option prototype embedding ok
Figure US12380360-20250805-P00018
n, k={1, 2, 3, . . . , K} vectors as learnable parameters. Next, the exemplary methods map each group of segment gm,k individually into a low-dimension embedding gm,k by classifying the segment into the corresponding option's category k. Meanwhile, the exemplary methods learn ok by minimizing the distance between ok and gm,k. Finally, the exemplary methods consider the segment which has the smallest distance with ok as the option prototype of ok.
Regarding segmentation embedding learning, the exemplary methods aim to learn a meaningful latent space to represent the segments, where they are clustered (in L2-distance) around semantically similar prototypical options, and the clusters from different classes are well-separated.
To achieve this goal, the exemplary methods use an LSTM to learn the segment's representation gv m′ :v m =fϕ(sv m′ :v_m) and the embeddings of prototypical option ok, where sv m′ :v_m, vm=t indicates the current segment generated by πh. To force the segment sv m′ :v_m and the option prototypes to be in the same space, the exemplary methods minimize the distance between gv m′ v m and its closest prototype ok. The optimization problem to be solved is:
e m b = m = 1 M min k = 1 K f ϕ ( s { υ m : υ m } ) - o k 2 2 ,
The minimization of
Figure US12380360-20250805-P00008
emb encourages each training segment to have some latent patch that is close to at least one prototypical option. These terms shape the latent space into a semantically meaningful clustering structure.
At block 309, prototypical option embedding learning takes place:
Since the option prototype embeddings ok=1 K are representations in the latent space, they are not readily interpretable. For interpretable, the exemplary methods propose to assign each prototypical option embedding ok=1 K with their closest segment embedding g in the training set.
As for learning option prototype embedding, the exemplary methods leverage both supervised learning and imitation learning regarding effectiveness and interpretability. The exemplary methods try to minimize the least square loss between g and ok to prevent learning multiple similar prototypical options. The exemplary methods use a diversity regularization term that penalizes prototypical options that are close to each other. Meanwhile, the exemplary methods also consider the downstream task (imitation learning).
The full objective function is:
option = - λ 1 * IL loss + λ 2 * i = 1 K min M m = 1 f ϕ ( s υ m : υ m ) - e i 2 2 ) + λ 3 * i = 1 K j = i + 1 K max ( 0 , d min - e i - e j )
where the first term is for effectiveness and where an imitation learning objective function is conducted to learn the segment embeddings and option prototype embeddings to mimic expert's policy πE. The second term is for interpretability where an evidence regularization is used to encourage each prototypical option embedding to be as close to an encoded instance as possible. The third term is a diversity regularization to learn diversified options and dmin is a threshold that classifies whether two prototypes are close or not. The exemplary methods set dmin to 1.0. λ1, λ2, λ3∈[0,1] are the weights used to balance the three loss terms.
At block 311, option policy learning takes place:
Each option o maintains its own policy πo:s→at, which is parameterized by its own parameters θo. To reduce the parameter complexity, the exemplary methods propose a contextual policy πθ(at|st, ok) to learn a conditional policy which is conditioned on both the state and the option, which shares among all the options.
The exemplary methods train the option policy πo(at|st, ok) by traditional imitation learning algorithms, e.g., behavior cloning and adversarial imitation learning. The goal of behavior cloning is to mimic the action of the expert at each time step via supervised learning technical. The goal of adversarial imitation learning is to minimize the JS divergence between trajectory distribution generated by the expert's policy and the option's policy.
Note that the same policy loss is used for both option prototypes and option policy, but the exemplary methods only optimize the parameters of option prototypes or option policy for each optimization step. The exemplary methods can further train the option policy with imitation learning algorithms, e.g., behavior cloning and adversarial imitation learning. The goal of option policy learning is to mimic the segmentations of demonstrations from the experts.
FIG. 4 is a block/flow diagram of an exemplary method for employing the option initialization, segmentation embedding learning, prototypical option learning, and option policy learning components of FIG. 3 , in accordance with embodiments of the present invention.
Imitation learning with neural networks efficiently learns a desired behavior in complex environments. However, these methods are usually considered as “black-boxes” which lack transparency, limiting their application in many decision-making scenarios. A variety of methods learn a hidden variable of the variation underlying expert demonstrations to construct the structure of expert policy and visualize the changes in the hidden variable. However, post-hoc explanations do not explain the reasoning process of how the model makes its decisions and can be incomplete or inaccurate in capturing the reasoning process of the original model. Therefore, it is often desirable to have models with built-in interpretability. The exemplary embodiments of the present invention define a form of interpretability in imitation learning that imitates human abstraction and explains its reasoning in a human-understanding manner. The exemplary methods enable prototype learning to discovery options for built-in interpretable imitation learning, which makes decisions by comparing the new inputs with a few data instances (prototypes).
Regarding the option initialization phase 303:
At block 401, attention mechanics and behavior cloning are utilized to extract the most important states considered while mimicking the expert's demonstration.
At block 403, for bottleneck state discovery, DBSCAN is used on the extracted states and the states are automatically clustered into groups.
Regarding policy over options learning 305:
At block 411, imitation learning is utilized to learn the intra-option policy, where the reward is calculated by the cumulative rewards from the primitive actions.
Regarding prototypical option learning 307:
At block 421, prototypical options are learned via minimizing the loss of the policy and projecting the prototypes to observed states.
Regarding prototypical option embedding learning 309:
At block 431, prototypical options are learned via minimizing the loss of the policy and projecting the prototypes to observed states.
Regarding option policy learning 311:
At block 441, the option policy is trained with imitation learning algorithms, such as behavior cloning, inverse imitation learning and adversarial imitation learning.
In summary, the exemplary methods introduce a new architecture, that is, prototypical option discovery for interpretable imitation learning (IPOD). Each prototypical option includes a set of segmentation from experts' trajectories and is embedded by an option policy. The IPOD uses a soft attention mechanism to derive prototypical option embedding from its trajectory fragments. Given a demonstration of the expert, the model matches the segmentations from the demonstration to the learned prototypical options, and makes an action based on the learned prototypical option. The exemplary methods also use the soft attention mechanism to create a bottleneck in the agent, forcing the agent to focus on option-relevant information. In this way, the model is interpretable, in the sense that it has a transparent reasoning process when making decisions. For better interpretability, the exemplary methods define several criteria for constructing the prototypes, including option diversity and accuracy.
The IPOD considers the prototype learning to discovery options for built-in interpretable imitation learning in accordance with the following as illustrated in FIG. 2 . Bottleneck state discovery segments the input trajectories into disjoint segments of variable length by, e.g., density-based clustering methods. Option projection includes representation learning of the segmentations in each cluster, and prototypical option embedding learning. Option refixation takes the low-level actions controlled through the prototypical option embedding and refines each group of segments by matching the segmentation embeddings to prototypical option embeddings.
FIG. 5 is a block/flow diagram 500 of a practical application of the IPOD architecture, in accordance with embodiments of the present invention.
In one practical example, a patient 502 needs to receive medication 504. Options are computed for indicating different levels of dosages of the medication 504. The exemplary methods learn a prototypical contextual policy π(a|s, o) to take action based on states 506. The IPOD architecture 670 is implemented to enable prototypical option visualization by executing a reasoning process 555 and evaluating policy performance 557. I2L 670, via the reasoning process 555, can smoothly compose the different options by considering the variant states 506 of the patient 502. In one instance, I2L 670 can chose the low-dosage option for the patient 502. The results 510 (e.g., dosage options) can be provided or displayed on a user interface 512 handled by a user 514.
FIG. 6 is an exemplary processing system for GBL, in accordance with embodiments of the present invention.
The processing system includes at least one processor (CPU) 604 operatively coupled to other components via a system bus 602. A GPU 605, a cache 606, a Read Only Memory (ROM) 608, a Random Access Memory (RAM) 610, an input/output (I/O) adapter 620, a network adapter 630, a user interface adapter 640, and a display adapter 650, are operatively coupled to the system bus 602. Additionally, an interpretable imitation learning framework 670 can be employed to execute option initialization 303, policy over options learning 305, prototypical option learning 307, prototypical option embedding learning 309, and option policy learning 311.
A storage device 622 is operatively coupled to system bus 602 by the I/O adapter 620. The storage device 622 can be any of a disk storage device (e.g., a magnetic or optical disk storage device), a solid-state magnetic device, and so forth.
A transceiver 632 is operatively coupled to system bus 602 by network adapter 630.
User input devices 642 are operatively coupled to system bus 602 by user interface adapter 640. The user input devices 642 can be any of a keyboard, a mouse, a keypad, an image capture device, a motion sensing device, a microphone, a device incorporating the functionality of at least two of the preceding devices, and so forth. Of course, other types of input devices can also be used, while maintaining the spirit of the present invention. The user input devices 642 can be the same type of user input device or different types of user input devices. The user input devices 642 are used to input and output information to and from the processing system.
A display device 652 is operatively coupled to system bus 602 by display adapter 650.
Of course, the processing system may also include other elements (not shown), as readily contemplated by one of skill in the art, as well as omit certain elements. For example, various other input devices and/or output devices can be included in the system, depending upon the particular implementation of the same, as readily understood by one of ordinary skill in the art. For example, various types of wireless and/or wired input and/or output devices can be used. Moreover, additional processors, controllers, memories, and so forth, in various configurations can also be utilized as readily appreciated by one of ordinary skill in the art. These and other variations of the processing system are readily contemplated by one of ordinary skill in the art given the teachings of the present invention provided herein.
FIG. 7 is a block/flow diagram of an exemplary method for executing the IPOD architecture, in accordance with embodiments of the present invention.
At block 701, initialize options by bottleneck state discovery, each of the options presented by an instance of trajectories generated by experts.
At block 703, apply segmentation embedding learning to extract features to represent current states in segmentations by dividing the trajectories into a set of segmentations.
At block 705, learn prototypical options for each segment of the set of segmentations to mimic expert policies by minimizing loss of a policy and projecting prototypes to the current states.
At block 707, train option policy with imitation learning techniques to learn a conditional policy.
At block 709, generate interpretable policies by comparing the current states in the segmentations to one or more prototypical option embeddings.
At block 711, take an action based on the interpretable policies generated.
FIG. 8 illustrates exemplary equations 800 for implementing the IPOD architecture, in accordance with embodiments of the present invention.
The equations include a loss function for segmentation embedding learning, an objective function, and policy losses.
As used herein, the terms “data,” “content,” “information” and similar terms can be used interchangeably to refer to data capable of being captured, transmitted, received, displayed and/or stored in accordance with various example embodiments. Thus, use of any such terms should not be taken to limit the spirit and scope of the disclosure. Further, where a computing device is described herein to receive data from another computing device, the data can be received directly from the another computing device or can be received indirectly via one or more intermediary computing devices, such as, for example, one or more servers, relays, routers, network access points, base stations, and/or the like. Similarly, where a computing device is described herein to send data to another computing device, the data can be sent directly to the another computing device or can be sent indirectly via one or more intermediary computing devices, such as, for example, one or more servers, relays, routers, network access points, base stations, and/or the like.
As will be appreciated by one skilled in the art, aspects of the present invention may be embodied as a system, method or computer program product. Accordingly, aspects of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “circuit,” “module,” “calculator,” “device,” or “system.” Furthermore, aspects of the present invention may take the form of a computer program product embodied in one or more computer readable medium(s) having computer readable program code embodied thereon.
Any combination of one or more computer readable medium(s) may be utilized. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical data storage device, a magnetic data storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can include, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated signal may take any of a variety of forms, including, but not limited to, electromagnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C++ or the like and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
Aspects of the present invention are described below with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the present invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks or modules.
These computer program instructions may also be stored in a computer readable medium that can direct a computer, other programmable data processing apparatus, or other devices to function in a particular manner, such that the instructions stored in the computer readable medium produce an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks or modules.
The computer program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks or modules.
It is to be appreciated that the term “processor” as used herein is intended to include any processing device, such as, for example, one that includes a CPU (central processing unit) and/or other processing circuitry. It is also to be understood that the term “processor” may refer to more than one processing device and that various elements associated with a processing device may be shared by other processing devices.
The term “memory” as used herein is intended to include memory associated with a processor or CPU, such as, for example, RAM, ROM, a fixed memory device (e.g., hard drive), a removable memory device (e.g., diskette), flash memory, etc. Such memory may be considered a computer readable storage medium.
In addition, the phrase “input/output devices” or “I/O devices” as used herein is intended to include, for example, one or more input devices (e.g., keyboard, mouse, scanner, etc.) for entering data to the processing unit, and/or one or more output devices (e.g., speaker, display, printer, etc.) for presenting results associated with the processing unit.
The foregoing is to be understood as being in every respect illustrative and exemplary, but not restrictive, and the scope of the invention disclosed herein is not to be determined from the Detailed Description, but rather from the claims as interpreted according to the full breadth permitted by the patent laws. It is to be understood that the embodiments shown and described herein are only illustrative of the principles of the present invention and that those skilled in the art may implement various modifications without departing from the scope and spirit of the invention. Those skilled in the art could implement various other feature combinations without departing from the scope and spirit of the invention. Having thus described aspects of the invention, with the details and particularity required by the patent laws, what is claimed and desired protected by Letters Patent is set forth in the appended claims.

Claims (18)

What is claimed is:
1. A method for learning prototypical options for interpretable imitation learning, the method comprising:
initializing options by bottleneck state discovery, each of the options presented by an instance of trajectories generated by experts;
applying segmentation embedding learning to extract features to represent current states in segmentations by dividing the trajectories into a set of segmentations;
learning prototypical options for each segment of the set of segmentations to mimic expert policies by minimizing loss of a policy and projecting prototypes to the current states;
learning prototypical option embedding using an objective function:
option = - λ 1 * I L loss + λ 2 * i = 1 K min m = 1 M = f ϕ ( s v m : v m ) - e i 2 2 ) + λ 3 * i = 1 K . j = i + 1 K max ( 0 , d min - e i - e j )
where LIL loss is an imitation learning loss, fφ is, the second term is a segment representation function for a segment sν′ m m from segment νm to segment νm′, ei and ej are embedded prototypes, K is a number of prototypes, M is a number of segments, dmin is a threshold value, and λ1, λ2, and λ3, are weighting parameters;
training option policy with imitation learning techniques to learn a conditional policy;
generating interpretable policies by comparing the current states in the segmentations to one or more prototypical option embeddings;
generating dosage options for a patient based on the interpretable policies;
displaying the dosage options on a user interface for a user; and
taking an action based on the dosage options.
2. The method of claim 1, wherein option initialization includes identifying states from the current states that connect different densely connected regions in a state space.
3. The method of claim 2, wherein a soft attention mechanism is employed to obtain important states with particular attention weights.
4. The method of claim 3, wherein the important states are found with density-based spatial clustering of applications with noise (DBSCAN).
5. The method of claim 1, wherein the bottleneck state discovery divides the trajectories generated by the experts into disjoint segments of variable length by a density-based clustering method.
6. The method of claim 1, wherein each of the options includes an intra-option policy, a termination condition, an initiation state set, and an option prototype.
7. The method of claim 6, wherein the option prototype is defined by a sub-trajectory generated by the experts.
8. The method of claim 1, wherein each of the one or more prototypical option embeddings is assigned with a respective closest segment embedding in a training set.
9. The method of claim 1, wherein the loss is a least square loss.
10. The method of claim 1, wherein a diversity regularization term is employed to penalize one or more of the prototypical options that are close to each other.
11. A non-transitory computer-readable storage medium comprising a computer-readable program for learning prototypical options for interpretable imitation learning, wherein the computer-readable program when executed on a computer causes the computer to perform the steps of:
initializing options by bottleneck state discovery, each of the options presented by an instance of trajectories generated by experts;
applying segmentation embedding learning to extract features to represent current states in segmentations by dividing the trajectories into a set of segmentations;
learning prototypical options for each segment of the set of segmentations to mimic expert policies by minimizing loss of a policy and projecting prototypes to the current states; learning prototypical option embedding using an objective function:
option = - λ 1 * I L loss + λ 2 * i = 1 K min m = 1 M = f ϕ ( s v m : v m ) - e i 2 2 ) + λ 3 * i = 1 K . j = i + 1 K max ( 0 , d min - e i - e j )
where LIL loss is an imitation learning loss, fφ is, the second term is a segment representation function for a segment sν′ m m from segment νm to segment νm′, ei and ej are embedded prototypes, K is a number of prototypes, M is a number of segments, dmin is a threshold value, and λ1, λ2, and λ3 are weighting parameters:
training option policy with imitation learning techniques to learn a conditional policy;
generating interpretable policies by comparing the current states in the segmentations to one or more prototypical option embeddings;
generating dosage options for a patient based on the interpretable policies;
displaying the dosage options on a user interface for a user; and
taking an action based on the interpretable policies generated dosage options.
12. The non-transitory computer-readable storage medium of claim 11, wherein option initialization includes identifying states from the current states that connect different densely connected regions in a state space.
13. The non-transitory computer-readable storage medium of claim 12, wherein a soft attention mechanism is employed to obtain important states with particular attention weights.
14. The non-transitory computer-readable storage medium of claim 13, wherein the important states are found with density-based spatial clustering of applications with noise (DBSCAN).
15. The non-transitory computer-readable storage medium of claim 11, wherein the bottleneck state discovery divides the trajectories generated by the experts into disjoint segments of variable length by a density-based clustering method.
16. The non-transitory computer-readable storage medium of claim 11, wherein each of the options includes an intra-option policy, a termination condition, an initiation state set, and an option prototype.
17. The non-transitory computer-readable storage medium of claim 16, wherein the option prototype is defined by a sub-trajectory generated by the experts.
18. The non-transitory computer-readable storage medium of claim 11, wherein each of the one or more prototypical option embeddings is assigned with a respective closest segment embedding in a training set.
US17/323,475 2020-05-26 2021-05-18 Interpretable imitation learning via prototypical option discovery for decision making Active 2044-04-29 US12380360B2 (en)

Priority Applications (5)

Application Number Priority Date Filing Date Title
US17/323,475 US12380360B2 (en) 2020-05-26 2021-05-18 Interpretable imitation learning via prototypical option discovery for decision making
JP2022572280A JP7466702B2 (en) 2020-05-26 2021-05-19 Interpretable Imitation Learning by Discovering Prototype Options
PCT/US2021/033107 WO2021242585A1 (en) 2020-05-26 2021-05-19 Interpretable imitation learning via prototypical option discovery
US19/230,357 US20250299112A1 (en) 2020-05-26 2025-06-06 Interpretable imitation learning via prototypical option discovery for decision making
US19/230,344 US20250299111A1 (en) 2020-05-26 2025-06-06 Interpretable imitation learning via prototypical option discovery for decision making

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US202063029754P 2020-05-26 2020-05-26
US202063033304P 2020-06-02 2020-06-02
US17/323,475 US12380360B2 (en) 2020-05-26 2021-05-18 Interpretable imitation learning via prototypical option discovery for decision making

Related Child Applications (2)

Application Number Title Priority Date Filing Date
US19/230,344 Continuation US20250299111A1 (en) 2020-05-26 2025-06-06 Interpretable imitation learning via prototypical option discovery for decision making
US19/230,357 Continuation US20250299112A1 (en) 2020-05-26 2025-06-06 Interpretable imitation learning via prototypical option discovery for decision making

Publications (2)

Publication Number Publication Date
US20210374612A1 US20210374612A1 (en) 2021-12-02
US12380360B2 true US12380360B2 (en) 2025-08-05

Family

ID=78705053

Family Applications (3)

Application Number Title Priority Date Filing Date
US17/323,475 Active 2044-04-29 US12380360B2 (en) 2020-05-26 2021-05-18 Interpretable imitation learning via prototypical option discovery for decision making
US19/230,344 Pending US20250299111A1 (en) 2020-05-26 2025-06-06 Interpretable imitation learning via prototypical option discovery for decision making
US19/230,357 Pending US20250299112A1 (en) 2020-05-26 2025-06-06 Interpretable imitation learning via prototypical option discovery for decision making

Family Applications After (2)

Application Number Title Priority Date Filing Date
US19/230,344 Pending US20250299111A1 (en) 2020-05-26 2025-06-06 Interpretable imitation learning via prototypical option discovery for decision making
US19/230,357 Pending US20250299112A1 (en) 2020-05-26 2025-06-06 Interpretable imitation learning via prototypical option discovery for decision making

Country Status (3)

Country Link
US (3) US12380360B2 (en)
JP (1) JP7466702B2 (en)
WO (1) WO2021242585A1 (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20230186107A1 (en) * 2021-12-14 2023-06-15 International Business Machines Corporation Boosting classification and regression tree performance with dimension reduction
CN115204387B (en) * 2022-07-21 2023-10-03 法奥意威(苏州)机器人系统有限公司 Learning methods, devices and electronic equipment under hierarchical goal conditions
JP7786689B1 (en) * 2024-11-11 2025-12-16 ソフトバンク株式会社 Information processing device, information processing method, and control program

Citations (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2872831A1 (en) * 2011-05-08 2012-11-15 Infinetics Technologies, Inc. Flexible radix switch
KR20130049201A (en) * 2010-09-30 2013-05-13 인텔 코오퍼레이션 Storage drive management
CN105393264A (en) * 2013-07-12 2016-03-09 微软技术许可有限责任公司 Interaction Segment Extraction in Human-Computer Interaction Learning
CN105893256A (en) * 2016-03-30 2016-08-24 西北工业大学 Software failure positioning method based on machine learning algorithm
JP2017142549A (en) * 2016-02-08 2017-08-17 ブレインズコンサルティング株式会社 Troubleshooting support apparatus, troubleshooting support program, and storage medium
EP3462385A1 (en) * 2017-09-28 2019-04-03 Siemens Aktiengesellschaft Sgcnn: structural graph convolutional neural network
EP2504776B1 (en) * 2009-11-24 2019-06-26 Zymeworks Inc. Density based clustering for multidimensional data
US20190324795A1 (en) * 2018-04-24 2019-10-24 Microsoft Technology Licensing, Llc Composite task execution
CN108805877B (en) * 2017-05-03 2019-11-19 西门子保健有限责任公司 Multiscale Deep Reinforcement Machine Learning for N-Dimensional Segmentation in Medical Imaging
CN110491171A (en) * 2019-09-17 2019-11-22 南京莱斯网信技术研究院有限公司 A kind of water transportation supervision early warning system and method based on machine learning techniques
WO2020162680A1 (en) * 2019-02-08 2020-08-13 아콘소프트 주식회사 Microservice system and method
CN111712862A (en) * 2018-02-14 2020-09-25 通腾运输公司 Method and system for generating traffic volume or traffic density data
US20200334093A1 (en) * 2019-04-17 2020-10-22 Microsoft Technology Licensing, Llc Pruning and prioritizing event data for analysis
CN111950950A (en) * 2019-05-17 2020-11-17 北京京东尚科信息技术有限公司 Planning method, device, computer medium and electronic device for order delivery route
WO2020235693A1 (en) * 2019-05-23 2020-11-26 国立大学法人神戸大学 Learning method, learning device, and learning program for ai agent that behaves like human
US20210295171A1 (en) * 2020-03-19 2021-09-23 Nvidia Corporation Future trajectory predictions in multi-actor environments for autonomous machine applications
CN109739585B (en) * 2018-12-29 2022-02-18 广西交通科学研究院有限公司 Spark cluster parallelization calculation-based traffic congestion point discovery method
JP7390126B2 (en) * 2019-07-31 2023-12-01 株式会社日立製作所 Trajectory data analysis system

Patent Citations (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2504776B1 (en) * 2009-11-24 2019-06-26 Zymeworks Inc. Density based clustering for multidimensional data
KR20130049201A (en) * 2010-09-30 2013-05-13 인텔 코오퍼레이션 Storage drive management
CA2872831A1 (en) * 2011-05-08 2012-11-15 Infinetics Technologies, Inc. Flexible radix switch
CN105393264A (en) * 2013-07-12 2016-03-09 微软技术许可有限责任公司 Interaction Segment Extraction in Human-Computer Interaction Learning
JP2017142549A (en) * 2016-02-08 2017-08-17 ブレインズコンサルティング株式会社 Troubleshooting support apparatus, troubleshooting support program, and storage medium
CN105893256A (en) * 2016-03-30 2016-08-24 西北工业大学 Software failure positioning method based on machine learning algorithm
CN108805877B (en) * 2017-05-03 2019-11-19 西门子保健有限责任公司 Multiscale Deep Reinforcement Machine Learning for N-Dimensional Segmentation in Medical Imaging
EP3462385A1 (en) * 2017-09-28 2019-04-03 Siemens Aktiengesellschaft Sgcnn: structural graph convolutional neural network
CN111712862A (en) * 2018-02-14 2020-09-25 通腾运输公司 Method and system for generating traffic volume or traffic density data
US20190324795A1 (en) * 2018-04-24 2019-10-24 Microsoft Technology Licensing, Llc Composite task execution
CN109739585B (en) * 2018-12-29 2022-02-18 广西交通科学研究院有限公司 Spark cluster parallelization calculation-based traffic congestion point discovery method
WO2020162680A1 (en) * 2019-02-08 2020-08-13 아콘소프트 주식회사 Microservice system and method
US20200334093A1 (en) * 2019-04-17 2020-10-22 Microsoft Technology Licensing, Llc Pruning and prioritizing event data for analysis
CN111950950A (en) * 2019-05-17 2020-11-17 北京京东尚科信息技术有限公司 Planning method, device, computer medium and electronic device for order delivery route
WO2020235693A1 (en) * 2019-05-23 2020-11-26 国立大学法人神戸大学 Learning method, learning device, and learning program for ai agent that behaves like human
JP7390126B2 (en) * 2019-07-31 2023-12-01 株式会社日立製作所 Trajectory data analysis system
CN110491171A (en) * 2019-09-17 2019-11-22 南京莱斯网信技术研究院有限公司 A kind of water transportation supervision early warning system and method based on machine learning techniques
US20210295171A1 (en) * 2020-03-19 2021-09-23 Nvidia Corporation Future trajectory predictions in multi-actor environments for autonomous machine applications

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
Abbeel et al., "Apprenticeship Learning via Inverse Reinforcement Learning", Proceedings of the 21st International Conference on Machine Learning. Jul. 5-9, 2004. pp. 1-8.
Eysenbach et al., "Diversity is All You Need: Learning Skills Without a Reward Function", arXiv:1802.06070v6 [cs.AI]. Oct. 9, 2018. pp. 1-22.
Ho et al., "Generative Adversarial Imitation Learning", arXiv:1606.03476v1 [cs.LG]. Jun. 10, 2016. pp. 1-14.
Li et al., "InfoGAIL: Interpretable Imitation Learning from Visual Demonstrations", arXiv:1703.08840v2 [cs.LG]. Nov. 14, 2017. pp. 1-14.
Ming et al., "Interpretable and Steerable Sequence Learning via Prototypes", 25th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Aug. 4-8, 2019. pp. 1-11.
Tomar et al., "Successor Options: An Option Discovery Framework for Reinforcement Learning", associarXiv:1905.05731v1 [cs.LG]. May 14, 2019. pp. 1-7.

Also Published As

Publication number Publication date
JP7466702B2 (en) 2024-04-12
WO2021242585A1 (en) 2021-12-02
US20210374612A1 (en) 2021-12-02
US20250299112A1 (en) 2025-09-25
JP2023527341A (en) 2023-06-28
US20250299111A1 (en) 2025-09-25

Similar Documents

Publication Publication Date Title
US20230153622A1 (en) Method, Apparatus, and Computing Device for Updating AI Model, and Storage Medium
US20250299112A1 (en) Interpretable imitation learning via prototypical option discovery for decision making
JP7316453B2 (en) Object recommendation method and device, computer equipment and medium
EP3782080B1 (en) Neural networks for scalable continual learning in domains with sequentially learned tasks
US11636347B2 (en) Action selection using interaction history graphs
US20240046128A1 (en) Dynamic causal discovery in imitation learning
CN106548210B (en) Credit user classification method and device based on machine learning model training
CN114616577A (en) Identifying optimal weights to improve prediction accuracy in machine learning techniques
CN114270365B (en) Clustering based on elastic centroid
US20200143498A1 (en) Intelligent career planning in a computing environment
US11176491B2 (en) Intelligent learning for explaining anomalies
CN114556331A (en) New frame for less-lens time action positioning
WO2025167876A1 (en) Object category recognition model training method and apparatus, and object category recognition method and apparatus
WO2022012347A1 (en) Predictive models having decomposable hierarchical layers configured to generate interpretable results
Lin Online semi-supervised learning in contextual bandits with episodic reward
Zhai et al. Classification of high-dimensional evolving data streams via a resource-efficient online ensemble
CN112348161B (en) Neural network training method, neural network training device and electronic equipment
Qi et al. Fedvad: Enhancing federated video anomaly detection with gpt-driven semantic distillation
Asif et al. A generalized meta-loss function for distillation based learning using privileged information for classification and regression
CN117056595A (en) An interactive project recommendation method, device and computer-readable storage medium
CN113095592A (en) Method and system for performing predictions based on GNN and training method and system
US20250181964A1 (en) Machine learning model method and system for cross-domain recommendations
US20250259072A1 (en) Automated single-to-grouped cloud computing optimization
US20250371100A1 (en) Efficient sampling for theorem proving
US20250006077A1 (en) Auto-scaling, simulated reality task training

Legal Events

Date Code Title Description
AS Assignment

Owner name: NEC LABORATORIES AMERICA, INC., NEW JERSEY

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:YU, WENCHAO;CHEN, HAIFENG;CHENG, WEI;SIGNING DATES FROM 20210512 TO 20210513;REEL/FRAME:056276/0245

FEPP Fee payment procedure

Free format text: ENTITY STATUS SET TO UNDISCOUNTED (ORIGINAL EVENT CODE: BIG.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

STPP Information on status: patent application and granting procedure in general

Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION

STPP Information on status: patent application and granting procedure in general

Free format text: NON FINAL ACTION MAILED

STPP Information on status: patent application and granting procedure in general

Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER

STPP Information on status: patent application and granting procedure in general

Free format text: NON FINAL ACTION MAILED

AS Assignment

Owner name: NEC CORPORATION, JAPAN

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:NEC LABORATORIES AMERICA, INC.;REEL/FRAME:071486/0094

Effective date: 20250621

STCF Information on status: patent grant

Free format text: PATENTED CASE