WO2022173593A1

WO2022173593A1 - Systems and methods for safe policy improvement for task oriented dialogues

Info

Publication number: WO2022173593A1
Application number: PCT/US2022/014034
Authority: WO
Inventors: Govardana Sachithanandam Ramachandran; Kazuma Hashimoto; Caiming Xiong; Richard Socher
Original assignee: Salesforce.Com, Inc.
Priority date: 2021-02-12
Filing date: 2022-01-27
Publication date: 2022-08-18
Also published as: JP2024507162A

Abstract

Embodiments described herein provide safe policy improvement (SPI) in a batch reinforcement learning framework for a task-oriented dialogue. Specifically, a batch reinforcement learning framework for dialogue policy learning is provided, which improves the performance of the dialogue and learns to shape a reward that reasons the invention behind human response rather than just imitating the human demonstration.

Description

SYSTEMS AND METHODS FOR SAFE POLICY IMPROVEMENT FOR TASK ORIENTED

DIALOGUES

Inventors: Govardana Sachithanandam Ramachandran, Kazuma Hashimoto, Caiming Xiong and Richard Socher

CROSS-REFERENCES

[0001] The present disclosure claims priority to U.S. nonprovisional application no. 17/500,855, filed October 13, 2021, which is a non-provisional of and claims priority to U.S. provisional application no. 63/148,861, filed on February 12, 2021 and is also a continuation-in-part of and claims priority to co-pending and commonly-owned U.S. nonprovisional application no. 17/105,262, filed November 25, 2020, which is a nonprovisional application of and claims priority under 35 U.S.C. 119 to U.S. provisional application no. 63/034,653, filed on June 4, 2020.

[0002] All of the aforementioned applications are hereby expressly incorporated by reference herein in their entirety.

TECHNICAL FIELD

[0003] The present disclosure relates generally to machine learning models and neural networks, and more specifically, to safe policy improvement for task-oriented dialogues.

BACKGROUND

[0004] Neural networks have been used to generate conversational responses and thus conduct a dialogue with a human user to fulfill a task. For example, a human user can engage in a conversation with an intelligent assistant to book travel tickets, make restaurant reservations, and/or the like. To fulfill a complex task, the intelligent assistant usually needs to learn to collectively complete multiple subtasks. For example, the assistant needs to reserve a hotel and book a flight so that there leaves enough time for commute between arrival and hotel check-in. For the intelligent assistant to learn such complex tasks, the intelligent assistant learns a dialogue policy to select among subtasks or options at a given time, which is often accompanied by a state tracker that tracks the status of the subtask.

[0005] Task-oriented dialogue systems are usually learnt from offline data collected using human demonstrations (e.g., past dialogues, etc.), but collecting diverse demonstrations and annotating them can be expensive. In addition, such offline task- oriented dialogue systems often involve disparate systems, such as a belief states tracker, dialogue policy management, response generation, etc. These disparate systems may induce stochasticity and its associated challenges in addition to the need for sample efficiency in effective dialogue policy learning.

[0006] Therefore, there is a need for efficient policy learning in task-oriented dialogue systems.

BRIEF DESCRIPTION OF THE DRAWINGS

[0007] FIG. 1A provides a diagram illustrating an example task-oriented dialogue described by a Markov Decision Process upon which safe policy improvement may be applied, according to one embodiment described herein.

[0008] FIG. IB provides a diagram illustrating an example task-oriented dialogue of multiple dialogue turns between the user and the agent shown in FIG. 1A, according to one embodiment described herein.

[0009] FIG. 2 provides a simplified diagram illustrating an example architecture of training a policy network with reward learning, according to one embodiment described herein.

[0010] FIG. 3A provides a simplified diagram illustrating aspects of the workflow inside the reward learning module shown in FIG. 2, according to one embodiment described herein.

[0011] FIG. 3B provides a simplified diagram illustrating a network architecture for the reward learning module shown in FIG. 2, according to one embodiment described herein. [0012] FIG. 4 is a simplified diagram of a computing device for implementing the safe policy improvement and reward learning for task-oriented dialogue, according to some embodiments.

[0013] FIGS. 5A-5B provide an example logic flow diagram illustrating a method of MDP- based safe policy improvement, according to an embodiment.

[0014] FIG. 6A provides an example pseudo-code segment illustrating an algorithm for causal aware safe policy improvement (CASPI), according to an embodiment described herein.

[0015] FIGS. 6B-6C provide an example logic flow diagram illustrating a method for the CASPI algorithm shown in FIG. 6A, according to an embodiment described herein.

[0016] FIG. 7 is a simplified block diagram illustrating a mixed human-in-the-loop and automatic evaluation metric scores for pairwise reward learning, according to embodiments described herein.

[0017] FIGS. 8-16 provide data charts showing example performance comparison of the safe policy improvement with existing methods, according to one embodiment.

[0018] In the figures and appendix, elements having the same designations have the same or similar functions.

DETAILED DESCRIPTION

[0019] Task-oriented dialogue systems are usually learnt from offline data collected using human demonstrations (e.g., past dialogues, etc.), but collecting diverse demonstrations and annotating them can be expensive. In addition, such offline task- oriented dialogue systems often involve disparate systems, such as a belief states tracker, dialogue policy management, response generation, etc. These disparate systems may induce stochasticity and its associated challenges in addition to the need for sample efficiency in effective dialogue policy learning. [0020] Some existing systems adopt off-policy based reinforcement learning (Batch-RL) methods in solving complex task. Batch-RL methods usually use historically annotated data instead of a simulator, which may be sample efficient because inexpensive simulator are usually readily available to sample data on-policy. These techniques, however, may not perform as efficient due to the nature of dialogue policy learning. For example, off-policy based learning may often require an estimation of behavior policy for a given state, e.g., a belief state, of the underlying Markov Decision Process (MDP). In real life, a belief state does not always capture the true state of the MDP, while the MDP latent state such as prosody, among others, may induce stochasticity in the agent response at each turn. In addition, semantic information may be lost when dialogue act is generated to a natural language text. The use of mere policy imitation for dialogue act may be insufficient to provide a fair reasoning to a particular outcome, if each constituent of composite action is focused on equally.

[0021] In view of the need for efficient policy learning in task-oriented dialogue systems, embodiments described herein provide safe policy improvement in a batch reinforcement learning framework for a task-oriented dialogue. Specifically, a dialogue policy is trained on the dialogue rollout generated by a latent behavior policy with performance guarantee, e.g., by reinforcing that the performance of a new policy is at least superior to the old behavior policy for a positive gap. A training loss objective is then defined by minimizing an expected discounted sum of future reward, subject to a condition that the KL divergence between the old behavior policy and the target policy is no greater than a pre-defined hyper-parameter. In this way, the bias in training over rollouts of another policy may be much reduced, thus resulting in "safe” policy improvement.

[0022] In addition, pairwise causal reward learning is provided to shape a reward that reasons the intention of human utterance instead of mimicking a human demonstration in a batch reinforcement setting. A combination of the safe policy improvement and the pairwise causal reward learning may achieve sample efficiency in learning complex tasks. [0023] As used herein, the term "network” may comprise any hardware or software-based framework that includes any artificial intelligence network or system, neural network or system and/or any training or learning models implemented thereon or therewith.

[0024] As used herein, the term "module” may comprise hardware or software-based framework that performs one or more functions. In some embodiments, the module may be implemented on one or more neural networks.

[0025] FIG. 1A provides a diagram 100 illustrating an example task-oriented dialogue described by a Markov Decision Process upon which safe policy improvement may be applied, according to one embodiment described herein. Diagram 100 shows a dialogue turn of a task-oriented dialogue between a user 110 and an intelligent agent 120. For example, the user 110 may provide a user utterance 101 "Book me a flight to London,” and the agent 120 may respond with a system response "when do you want to leave?” 120. The dialogue between the user 110 and the intelligent agent 120 may form a task-oriented dialogue to complete the task of planning a trip to London.

[0026] The task-oriented dialogue may be modeled as a Markov Decision Process (MDP), shown by the connected graph structure 110. The MDP is described by the tuple {S, A, P, R, g} of states S, actions A, transition probability P, reward R, and a discount factor y. The states S are dialogue contexts that are the agent’s interpretation of the environment.

Actions A are possible communication behaviors that are available to the agent at each state. Transition probability P defines the probability that the states S transitions to another set of states S' given the actions A For example, the intelligent agent 120 at time step t with state st may perform a composite action at as per a target policy n_e(a_t\s_t ) on the environment, with transition probabilities to the next state PfS'/S, A). For example, in the state 105 si after user utterance 101, the original city is confirmed (per user location), the destination city "London” is obtained from the user utterance 101, but the departure date and departure time are unknown. Thus, a dialogue act 106 may be performed according to the target policy p _e(a₂ Is-J to request information on the departure date, with the agent 120 replying to user 110 with the system response 102. After the dialogue act 106, the dialogue state transitions from state si to S2. [0028] A latent reward function,

with a discount factor is associated

with the MDP 120, defining a reward value given the set of states and actions. For example, a positive reward r 115 of "20” is assigned given the state si and dialogue act ai. In one embodiment, the latent reward function R(a, s) and the discount factor g may be predefined for the MDP. In another embodiment, the latent reward function R(a, s) and the discount factor g may be learnt through the pairwise causal reward learning mechanism described in relation to FIG. 3.

[0029] In one embodiment, given the reward function and the discount factor, the objective is to optimize for the target policy 7r_e(a_t|s_t), which maximizes the expected discounted sum of future reward on the MDP, which may be written as the state-action function is the future reward at

future time t' which can be similarly defined with the reward function R(a,s). To achieve this objective, a "safe” policy improvement mechanism is described in relation to FIGS. 2 and 5.

[0030] FIG. IB provides a diagram illustrating an example task-oriented dialogue of multiple dialogue turns between the user and the agent shown in FIG. 1A, according to one embodiment described herein. The dialogue shown in FIG. IB corresponds to a goal 122, e.g., relating to book a train that departs at a certain time leaving for a certain destination. The dialogue includes 4 dialogue turns, each of which includes a delexicalized user utterance 125a-d, an agent dialogue act 126a-d, and a delexicalized agent utterance/response 127a-d, respectively. The 4 dialogue turns may show that the use of mere policy imitation for dialogue-act may fall short of reasoning on the outcome, but rather focus on each constituent of composite action equally. For example, Turns 3 and 4 are rich in semantic information: Turn 3 provides the key to the transaction of the booking process, while Turn 4 of the least use in the success of the conversation gets an equal weight as other semantically rich turn. Such specifics are lost in the imitation policy learning.

[0031] FIG. 2 provides a simplified diagram 200 illustrating an example architecture of training a policy network with reward learning, according to one embodiment described herein. Diagram 200 shows that a training dataset 210 is input to a policy network 220, and a reward learning module 260. Specifically, the dataset 220 includes a plurality of rollouts 212a-n from dialogues. The rollouts 212a-n may be generated by human agents performing actions based on a latent stochastic behavior policy.

[0032] For example, in offline Batch-RL, the intelligent agent does not get to interact with the environment. Instead, the set of offline data D 210 logged by human agents performing actions based on a latent stochastic behavior policy n_b can be obtained. The set of offline data D 210 includes a plurality of rollouts 212a-n of a dialogue, each denoted by . Each rollout tⁱ =

where each ot is the observation at

turn £, composing of . Here bt is the belief state of the agent at turn t,

and uf__± are the user and agent utterance at time t and t - 1, respectively. Thus, batch-RL entails training a policy on rollouts generated by the latent behaviour policy.

[0033] However, directly optimizing a training objective, e.g., the discounted sum of future reward, on the rollouts of another policy, leads to a large bias in the value function estimation, poor generalization characteristic, and sample inefficiency. Thus, a "safe” policy improvement may be implemented, such that the new policy performance is bounded compared to the old policy. Specifically, the value function of the new target policy 7T_e and the value function of the latent behavior policy n_b satisfies: Pr (V^ne ³ V^Ub — z) ³ 1 — d, where V^Ue and V^Ub are value functions of the target policy and behavior policy, respectively. Here 1 — d and z are the high probability and approximation metaparameters, respectively.

[0034] Thus, based on the input observations from the dataset 210, the policy network 220 may generate a target act distribution according to a

target policy and the parameter 0 of the policy network. Then, a stochastic loss objective L_sto(θ ) may be computed at loss module 230 for the safe policy improvement: (1)

[0035] In some implementations, the stochastic loss objective 5_6^7(4) may be computed using the belief state '_^ to replace ^_^ in Eq. (1). The belief state chastic variable as it

does not capture all information. The policy ^_^('_^; 4) is computed for optimizing the stochastic loss function.

[0036] Traditionally, the update mechanism provided in Schulman et al., Trust Region Policy Optimization, in Proceedings of International conference on machine learning, pp. 1889-1897, 2015, provides bounded errors as long as the constraints of (1) are met, where DKL(.||.) is the KL divergence and η is a hyper-parameter. However, the Schulman update rule requires access to the behavior policy πb(at|st) which is intractable to estimate. Instead, the behaviour policy conditioned on the belief state bt ^_^('_^) may be estimated as against st in (1), which results in a stochastic behavior policy. The belief state bt is part of the observation ot at turn t that can be obtained from a specific rollout in the dataset D 210. Thus, in one implementation, when computing the stochastic loss objective in (1), ^_^(^_^) may be approximated by ^_^('_^) which can be obtained from the rollouts in the dataset 210. For example, the estimation of ^_^ ⁽'_^ ⁾ may be given by the number of occurrence of a dialogue act at given bt divided by the total number of act at given bt. [0037] Based on availability of more evidence of the observation ot (which contains more information than the belief state bt), the mode of the policy may collapse to a near deterministic action. To factor this into the policy learning, an additional deterministic loss may be computed at loss module 240: 5_B^^(4) = −^_(7C,*C)~D^E(^, =) F#G F#G ^_^(^_^|#_^) ^ , (2)

where E , =% = ^∑ _^^_{^^ HI} J_HK(G , _^ ^{^} _^^, _^ ^{^}) is the discounted sum of future reward for a single as rollout ^ with goal G from time step t; the

discount factor is a function of parameter 4_^; J_HK(G, ^, ^) is reward function of the states, actions and the goal, given parameter 4_^. T and discount factor ^_H ^{^^^^} are learnt

_I by the reward learning module 260

Henc

bined loss module 2 mputes the

policy optimization loss function as: (3)

[0038] In one embodiment, the network 220 may be trained using just the stochastic loss , or just the deterministic loss . Alternatively, the network 220 is trained

by the sum L(θ) of the two losses as described below.

[0039] In one embodiment, the combined loss module 250 may achieve the loss function (3) via two forward passes on the policy network 220. For example, in the first pass, only the belief state {btj from the dataset 210 are input to the policy network 220 such that the first pass captures the stochasticity of the policy conditioned only on the belief state {btj. During the first pass, the stochastic loss module 230 computes the stochastic loss in (1) using the action distribution output

from the policy network 220. In the second pass, all the observation information from the dataset 210 is input to

the policy network 220 to get the action distribution

for the deterministic loss module 240 to compute the deterministic loss in (2). The second pass collapses the mode given other latent information of the state, such as u^u and u^a. After the two passes, the combined loss module 250 compute the loss objective in (3), which may be used to update the policy network 220 via backpropagation. Further details of the work flow for implementing the safe policy improvement with policy network 220 can be found in relation to FIGS. 5A-5B.

[0040] As shown above, the stochastic loss objective (1) for safe policy improvement requires the Q-function of the latent behaviour policy, which can be estimated using Monte Carlo sampling on the dataset D, given the reward R(s, a, g) is known. The reward learning module 260 provides a mechanism to learn a reward that is causally reasoned on the intention of the human demonstrator. The reward learning module 260 provides the reward function R(s, a, g) and the discount parameter g to the stochastic loss module 230 and the deterministic loss module 240. Further details of the reward learning module 260 is described below in relation to FIG. 3.

[0041] FIG. 3A provides a simplified diagram illustrating aspects of the workflow inside the reward learning module 260 shown in FIG. 2, according to one embodiment. Specifically, dialogue policy learning is usually accompanied by a metric M, to evaluate the performance of the learnt policy. Though these metrics could serve as a proxy for a reward function, directly combining them into learning the reward can be challenging. For example, these metric functions usually return a metric score for the entire dialogue. Given the complex state-action space of the dialogue management system, the scores at the dialogue level are under-specified for rewarding an action performed at each dialogue turn.

[0042] To address this under-specified feedback, a preference learning may be adapted from an online setting to an offline setting. For example, the preference learning was originally proposed in Paul et ai, Feature selection as causal inference: Experiments with text classification, in Proceedings of the 21st Conference on Computational Natural Language Learning, pages 163-172, 2017. The reward can be parametrized for every timestep t, as r(o_t, a_t, g). Given a pair of rollouts t¹, t² E D with actions for each state in the rollouts sampled from the learnt policies 7 t\ and p², respectively, let R[t^c > t²] be the probabilistic measure that captures the preference of pΐ over p², then this preference is true when the sum of rewards of each dialogue rollout of the two rollouts satisfies:

[0043] As further described in relation to FIG.

defined as the preferential probability represented by:

Here ø() could either be exp Q, or identity 1( ). For example, the probability may be computed using hyper parameters:

[0044] Thus, reward R may be optimized by minimizing a binary cross-entropy loss between the preference probability R[t^c > t²] and the normalized metrics score m(t) between a pair of rollouts. For example, the normalized metric score is computed based on a first metric score of a first dialogue t¹ from the pair and a second metric score of a second dialogue from the pair, and both the first metric score and the second metric score are generated by the same score function Iⁿ this way, the

network (with the reward) is trained to generate dialogues with performance metrics that can closely reflect the preference between a rollout pair. The loss objective for pairwise reward learning can be computed by:

where,

[0045] Here qi and Q2 correspond to the parameters for reward R(a, s, g; qi) and discount factor Y(02), respectively. Specifically, the discount factor g may be pre-defined, or learnt during training.

[0046] Thus, the reward learning module 260 receives and splits the dataset D into Refold training and validation subsets 261. For example, the dataset 210 is partitioned into complementary subsets 261, performing training on one subset, and validating the trained network on another (test) subset. At every epoch of training, K-baseline models 261a-n are trained based on cross entropy loss (instead of (3)) using the K training subsets. The trained K-baseline models 261a-n are used to predict on the corresponding validation subsets, and each baseline model may be similar to the neural model used by the policy network 220. The predicted action distribution from the K-baseline models are used to generate output dialogues 264a-n, each of which is scored by a chosen metric 263. Thus, a pair of dialogues from the predicted dialogues 264a-n with corresponding score functions may be used to compute the pairwise reward loss (4) at the pairwise causal reward learning module 265. The pairwise reward loss (4) may then be used to backpropagate a neural network to update the parameters qi, 02. In this way, the pairwise causal reward learning module 265 outputs the reward function reward R(a, s, g; qi) and discount factor g(q2). For example, the neural network for the pairwise causal reward learning module 265 may be a one bi-LSTM layer that embeds action, state and goal, followed by a couple of multilayer perceptron (MLP) layers.

[0047] In another embodiment, le , then the parameter Q can be updated by:

(6)

[0048] The learnt reward is akin to sample weights for each instance of the data, which helps to redistribute the gradient update budget among the samples based of their contribution to the overall success of the Task oriented Dialogue (ToD) system. To this end, learnt reward may be used as a sample weight to any existing ToD dialogue system to reap the benefit of sample efficiency it brings.

[0049] In one embodiment, the dialogue roll-outs are generated by expert latent policy. The data (dialogue rollouts) may be distributed as per the optimal latent policy and transition probability. The process of learning a policy that maximizes the likelihood of the data may be a curriculum for exploring the state action for the pairwise reward learning objective (5). The process of fitting a maximum likelihood (MLE) policy may induce useful perturbation by the stochasticity of an optimizer. After the output dialogues 264a-n are scored by a chosen metric 263, on the convergence of the MLE process, the pairs of learnt roll-outs with the corresponding metric scores may be used to train the preferential optimization (5), which in turn learns the fine grained reward R(a, s, g; qi).

[0050] FIG. 3B provides a simplified diagram illustrating a network architecture 300 for the reward learning module 260 shown in FIG. 2, according to one embodiment described herein. In one embodiment, three single bi-LSTM layers, each is used to encode the goal, belief state and dialogue act or response sequences at each dialogue turn on each of the sampled roll-outs pairs. For example, the bi-LSTM layer 301a is used to encode the goal of the sampled predicted rollout ti; the bi-LSTM layer 302a is used to encode the belief state of each dialogue turn of rollout ti; and the bi-LSTM layer 303a is used to encode the dialogue act of each dialogue turn of rollout ti. Similarly, the bi-LSTM layer 301b is used to encode the goal of the sampled predicted rollout X2; the bi-LSTM layer 302b is used to encode the belief state of each dialogue turn of rollout X2; and the bi-LSTM layer 303b is used to encode the dialogue act of each dialogue turn of rollout X2.

[0051] In one embodiment, the three bi-LSTM layers can be used to encode both the rollout xi and X2. In another embodiment, two sets of parallel bi-LSTM layers 301a, 302a, and 303a, and 301b, 302b and 303b may be used to encode the pair of sampled rollouts, respectively in parallel.

[0052] The three encoded representations from bi-LSTM layers 301a, 302a, and 303a are concatenated, at 305a. Or the three encoded representations from bi-LSTM layers 301b, 302b, and 303b are concatenated, at 305b.

[0053] The concatenated representation is then fed through couple of feed-forward layers before making a bounded reward prediction

or

for each turn of the rollout xi or X2 using a sigmoid function. The per turn rewards are summed, e.g., at over all turns of each rollout to form a global reward R(xi) or R(X2) for the pair of rollouts.

[0054] Using a pair of dialogue rewards R(xi) and R(x2), the probabilistic preference between the rollouts can be computed either by standard normalization or a softmax function, e.g.,

where the f{ ) function may be standard normalization or a softmax function. The output 307 of this preference probability may be optimized using a cross entropy loss described in Eqn. (4).

[0055] Figure 4 is a simplified diagram of a computing device for implementing the safe policy improvement and reward learning for task-oriented dialogue, according to some embodiments. As shown in Figure 4, computing device 400 includes a processor 410 coupled to memory 420. Operation of computing device 400 is controlled by processor 410. And although computing device 400 is shown with only one processor 410, it is understood that processor 410 may be representative of one or more central processing units, multicore processors, microprocessors, microcontrollers, digital signal processors, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), graphics processing units (GPUs) and/or the like in computing device 400. Computing device 400 may be implemented as a stand-alone subsystem, as a board added to a computing device, and/or as a virtual machine.

[0056] Memory 420 may be used to store software executed by computing device 400 and/or one or more data structures used during operation of computing device 400. Memory 420 may include one or more types of machine readable media. Some common forms of machine readable media may include floppy disk, flexible disk, hard disk, magnetic tape, any other magnetic medium, CD-ROM, any other optical medium, punch cards, paper tape, any other physical medium with patterns of holes, RAM, PROM, EPROM, FLASH- EPROM, any other memory chip or cartridge, and/or any other medium from which a processor or computer is adapted to read.

[0057] Processor 410 and/or memory 420 may be arranged in any suitable physical arrangement. In some embodiments, processor 410 and/or memory 420 may be implemented on a same board, in a same package (e.g., system-in-package), on a same chip (e.g., system-on-chip), and/or the like. In some embodiments, processor 410 and/or memory 420 may include distributed, virtualized, and/or containerized computing resources. Consistent with such embodiments, processor 410 and/or memory 420 maybe located in one or more data centers and/or cloud computing facilities.

[0058] In some examples, memory 420 may include non-transitory, tangible, machine readable media that includes executable code that when run by one or more processors (e.g., processor 410) may cause the one or more processors to perform the methods described in further detail herein. For example, as shown, memory 420 includes instructions for a safe policy improvement module 430 and a reward learning module 435 that may be used to implement and/or emulate the systems and models, and/or to implement any of the methods described further herein. In some examples, the safe policy improvement module 430 and the reward learning module 435 receives an input 440 via a data interface 415 and may generate an output 450.

[0059] For example, the input 440 may include a training dataset 210 as shown in FIGS. 2-3. The data interface 215 may include a communication interface that receives the dataset input 440 from a remote database via a communication network. In another example, the data interface 215 may include a user interface via which a user may select and load the dataset input 440 to the processor 410. The output 450 may include an action distribution for a dialogue, an optimized policy, and/or the like.

[0060] The safe policy improvement module 430 may comprise a policy network 220, a stochastic loss module 230, a deterministic loss module 240, and a combined loss module 250 shown in FIG. 2. The reward learning module 435 may be similar to module 260 shown in FIG. 2, which is further detailed in FIG. 3. The reward learning module 435, as described in relation to FIG. 3, may comprise K-base models 262a-n and a pairwise causal reward learning module 265.

[0061] FIGS. 5A-5B provide an example logic flow diagram illustrating a method 500 of MDP-based safe policy improvement, accordingly to an embodiment. One or more of the processes 502-524 of method 500 may be implemented, at least in part, in the form of executable code stored on non-transitory, tangible, machine-readable media that when run by one or more processors may cause the one or more processors to perform one or more of the processes 502-524. In some embodiments, method 500 may correspond to the method used by the module 430.

[0062] At process 502, A training dataset (e.g., dataset 210) comprising a plurality of dialogue rollouts (e.g., rollouts 212a-n) generated by a latent stochastic behavior policy is received. Each rollout includes a time series of observations representing information of a respective dialogue at a plurality of dialogue turns.

[0063] At process 504, only belief states (e.g., {bt}) from the observations of the training dataset is input to a neural model (e.g., policy network 220) in a first pass to the neural model. [0064] At process 506, a first predicted action distribution is generated based on a current state of the respective dialogue according to a target policy,

[0065] At process 508, a first discounted sum of future reward based on a discount parameter and a reward function of actions and states of the respective dialogue according to the latent behavior policy. Specifically, during the first pass, an action distribution is conditioned on a belief state according to the latent stochastic behavior policy, and the belief state is obtained from the time series of observations.

[0066] At process 510, a first loss objective is computed based on a first expectation of the first discounted sum of future reward and the first predicted action distribution. Specifically, the first expectation is taken over a probability distribution of the states and the actions according to the latent stochastic behavior policy, e.g., according to (1).

[0067] At process 512, the full observations are input to the neural model in a second pass. For example, in addition to the belief states, all the observation information

from the dataset 210 is input to the policy network 220.

[0068] At process 514, a second predicted action distribution is generated based on a current observation from the time series of observations according to the target policy. For example, the action distribution n_e(o_t) is generated.

[0069] At process 516, a second discounted sum of future reward based on the discount parameter and the reward function for a specific rollout is computed,

Specifically, the second discounted sum of future reward is a

collapsed near-deterministic approximation of the first discounted sum of future reward.

[0070] At process 520, a second loss objective is computed based on a second expectation of the second discounted sum of future reward and the second predicted action distribution. Specifically, the second expectation is taken over an average of the observations across the training dataset. For example, the second loss objective is computed by the deterministic loss module 240 according to (2). [0071] At process 522, a combined loss objective is compute by summing the first loss objective and the second loss objective, e.g., according to (3).

[0072] At process 524, the neural model is updated based on the combined loss objective, subject to a condition that a KL-divergence between the latent stochastic behavior policy and the target policy conditioned on the current state of the respective dialogue is less than a pre-defined hyperparameter.

[0073] FIG. 6A provides an example pseudo-code segment illustrating an algorithm for causal aware safe policy improvement (CASPI), according to an embodiment described herein. The (train) dataset is subsampled into K-fold training DT and validation sets Dv. K- baseline models are trained to fit the data distribution generated by experts using cross entropy loss. During the process of fitting the data distribution, the still learning K-policies are used to predict on their corresponding K-fold validation subsets at every epoch of the training. Each of the dialogue is scored by the chosen dialogue level metric during the training. On convergence of the supervised learning process, pairs of dialogue predictions generated by the above process, along with their corresponding metric score are used to train for preferential optimization objective Eqn. (4), which in-turn learns fine grained reward R(a; s; g; Q). The use of K-fold subsampling and K-baseline models helps generate stochaticity in the samples generated. It also helps in effectively using the data and make the method sample efficient.

[0074] FIGS. 6B-6C provide an example logic flow diagram illustrating a method for the CASPI algorithm shown in FIG. 6A, according to an embodiment described herein. One or more of the processes 602-626 of method 600 may be implemented, at least in part, in the form of executable code stored on non-transitory, tangible, machine-readable media that when run by one or more processors may cause the one or more processors to perform one or more of the processes 602-626. In some embodiments, method 500 may correspond to the method used by the module 430. [0075] At process 602, A training dataset (e.g., dataset 210) comprising a plurality of dialogue rollouts (e.g., rollouts 212a-n) generated by a latent stochastic behavior policy is received.

[0076] At process 604, the training dataset is repeatedly sampled for a number of times to generate a number of training subsets and a number of validation subsets. For example, as escribed in relation to FIG. 3A, the dataset D is split into K-fold training DT and validation subsets Di/261. For example, the dataset 210 is partitioned into complementary subsets 261, performing training on one subset, and validating the trained network on another (test) subset.

[0077] At process 606, for each dataset in [DT, DV), a task-oriented dialogue model is trained based on a cross-entropy loss using training data in a first training subset of the number of training subsets. For example, a dataset is retrieved from the number of training subsets or the number of validation subsets [DT, DV), and the task-oriented dialogue model is updated by minimizing an entropy of a predicted dialogue action conditioned on a current state of a dialogue according to a target policy using dialogue data from the retrieved dataset. The entropy loss can be expressed as:

where n_m[s) denotes predicted dialogue action d according to the policy n_m conditioned on the dialogue states s.

[0078] At step 608, for the same respective dataset from step 606, the task-oriented dialogue model generates predicted dialogue rollouts from dialogue data in a first validation subset of the number of validation subsets.

[0079] At step 610, the predicted dialogue rollouts are added to a pairwise causal learning subset Dp. From step 612, steps 608-610 may be repeated if there is another training epoch. If there is no other training epoch at step 612, method 600 may determine whether there is another dataset in [DT, DV) at step 616. If there is another dataset, method 600 proceeds to repeat from step 606 with another dataset. If there is no other dataset, method 600 proceeds to step 618.

[0080] At step 618, a pair of dialogue rollouts may be sampled from the pairwise causal learning subset.

[0081] At step 620, the task-oriented dialogue model may be trained based on a binary cross-entropy loss between a preferred probability between the pair of dialogue rollouts and a normalized metric score based on the pair of dialogue rollouts. For example, step 620 may be illustrated by the process flow described in relation to FIG. 3B.

[0082] At step 622, method 600 determined whether training convergence has been reaching using data Dp. If not, method 600 repeats from step 618 with re-sampling another pair of sampled pair of dialogue rollouts. If convergence has been reached using data Dp, method 600 proceeds to step 624.

[0083] At step 624, the task-oriented dialogue model may be trained based on a policy optimization loss that optimizes over the target policy using the training dataset. For example, the optimization over policy is discussed in relation to method 500 in FIGS. 5A- 5B.

[0084] At step 626, method 600 determined whether training convergence has been reaching using data D. If not, method 600 repeats from step 624. If convergence has been reached using data D, method 600 may end.

[0085] FIG. 7 is a simplified block diagram illustrating a mixed human-in-the-loop and automatic evaluation metric scores for pairwise reward learning, according to embodiments described herein. Automatic evaluation metrics have their own biases. True objective of ToD is human experience while interacting with the dialogue systems, which automatic evaluation metrics might fall short to capture. To this end human evaluation may be conducted on the quality of the generated response. Quality can be defined by the following criteria: (a) Appropriateness, e.g., are the generated responses appropriate for the given context in the dialogue turn? (b) Fluency, e.g., are the generated responses coherent and com-prehensible?

[0086] Therefore, as shown in FIG. 7, after prediction on the K validation sets by K models at each epoch of training at 710, and the pairwise causal reward learning at 702, a dialogue turn in the test set is randomly picked. The human evaluators were shown context leading up to the turn and gave an evaluation score at 730 of the dialogue turn. The predictions for the turn by different models were anonymized and displayed to the evaluators. For example, the human evaluators were asked to give a score between 1 and 5 for appropriateness and fluency, with score of 5 being best and 1 being the worst. 100 randomly selected dialogue turns were presented to 10 participants.

[0087] The ToD model is then trained for reward R(s, a, g) using pairwise causal reward learning as described in relation to FIGS. 6A-6C, where examples of the mini batch are randomly sampled either from human scored examples 730 or the ones scored by the automatic evaluation metric 740.

[0088] It is noted that embodiments described throughout FIGS. 1A-7 relate to dialogue policy learning. However, similar embodiments can be applied to different tasks in similar settings, such as but no limited to end-to-end dialogue system training (e.g., dialogue state tracker, dialogue policy and response generation, etc.), and/or the like.

Example Performance

[0089] In one embodiment, the training dataset (e.g., 210) can be the MultiWoz2.0 dataset, a multi-turn multi-domain dataset spanning seven domains, including attraction, hos-pital, hotel, police, taxi, train and an additional domain for general greeting. The dataset is created from real human conversation, between a tourist and a clerk at an information center. Each dialogue is generated by users with a defined goal which may cover 1-5 domains with a maximum of 13 turns in a conversation. The dataset has 10438 dialogues split into 8438 dialogues for training set and 1000 dialogues each for validation and test set. [0090] In one embodiment, the policy network 220 and/or the reward learning network 260 may adopt a neural model proposed in Zhang et al, Task-oriented dialog systems that consider multiple appropriate responses under the same context, arXiv preprint arXiv: 1911.10484, 2019 as the baseline (referred to as "DAMD”). For the pairwise casual reward learning network 260, a one bi-LSTM layer to embed action, state and goal, followed by couple of MLP layers may be used. DAMD is composed of three seq2seq generative model using GRUs. The three seq2seq models are one each for belief state, dialogue act and response generation modules. An attention layer is then used to attend the outputs of the seq2seq models with the context vector of previous turn for copy over mechanism. The outputs are then used as representation for predicting series of tokens for their respective modules. Both stochastic, Lsto and deterministic, Ldet loss functions are used on dialogue act. For DST and response generation, the cross entropy loss is used as is from DAMD.

[0091] In one embodiment, the reward learning network 260 includes another model with more complexity includes the Task Oriented Dialogue model, MinTL described in Lin et al., Mintl: Minimalist transfer learning for task-oriented dialogue systems, arXiv preprint arXiv:2009.12005, 2020. MinTL uses a large pretrained language model BART that use as a standard encoder decoder transformer architecture with a bidirectional encoder and an autoregressive decoder. It is pre-trained on the task of denoising corrupt documents. BART is trained using cross-entropy loss between the decoder output and the original document. MinTL doesn’t explicitly predict dialogue act. Hence the deterministic loss, Ldet is used directly on the generated response and for DST we retain the loss as is from MintTL.

[0092] In one embodiment, database results are represented as one-hot vectors. To reduce surface-level variability in the responses, domain-adaptive delexicalization preprocessing is adopted, and delexicalized responses are generated with placeholders for specific values which can be filled according to the current utterance that refers to some slot values offered by the system in the previous turn.

[0093] In one embodiment, context-to-response generation task of Multi-woz2.0 may be implemented and the corresponding evaluation metrics are used to measure the quality of the response. These include inform rate and success rate which measures the fraction of dialogue, the system has provided requested information and the fraction of the dialogues the system has answered all the requested information respectively, and BLEU is used to measure the fluency of the generated response. Both of these setting uses three evaluations metrics. These include: 1) inform rate - measures the fraction of dialogue, the system has provided the correct entity, 2) success rate - fraction of dialogues, the system has answered all the requested information and 3) BLEU - measures the fluency of the generated response. The combined score (Inform + Success) ^c 0:5 + BLEU is also used. All the numbers of CASPI reported are median of 5 runs with different seeds.

[0094] For the metric M used in pairwise causal reward learning, the following metric is used:

M := Inform + Success + l ^c BLEU

This is very similar to combined score used in evaluation and both are equivalent when l = 2. Hyperparamter l is used to normalize the achievable scale of BLEU. The success rate, if used as is, will result in non-markovian and stochastic per turn reward function, since the reward of current state will depend on the performance of future states. Hence, a soft version of the metric Msoft is used, where the success rate measures a fraction of requested information provided in a dialogue. The original metric that uses the discrete variant of success rate is referred to as Mhard. The choice of action in reward function R(st, at, g) can either be dialogue act or generate response, we refer corresponding variants of metrics as M(act) and M(resp). To demonstrate the versatility of the method to adapt to different metrics, all the discussed variants of the metric are used.

[0095] The causal aware safe policy improvement (CASPI) is compared against existing methods on context-to-response generation task of Multiwoz2.0 in FIG. 8. The existing methods include:

[0096] DAMD: Introduced by Zhang et al. is a domain-aware multi-decoder network. The method also exploits stochastic nature of the dialogue act by using a data-augmentation technique called the multi-action data augmentation. DAMD with data augmentation is denoted here as DAMD + multiaction.

[0097] HDSA by ( Chen et al., Semantically conditioned dialog response generation via hierarchical disentangled self-attention. (HDSA), arXiv preprint arXiv: 1905.12866, 2019) proposes to use hierarchical graph representation for dialogue act. It uses a pre-trained 12- layer BERT model to represent dialogue act. The predicted dialogue act is transformed to the hierarchical graph structure using disentangled self-attention model, a 3-layer self-attention model.

[0098] SOLOIST ( Peng et al, Soloist: Few-shot task-oriented dialog with a single pretrained auto-regressive model, arXiv preprint arXiv:2005.05298, 2020). These method are trained on turn-level data without generated belief state and system act in dialog history.

[0099] MinTL-BART (Lin et al.), introduced Levenshtein belief spans framework that predicts only the incremental change in dialogue state per turn. It leverages the pretrained T5 and BART as backbone for model architecture.

[00100] HDNO proposed by (Wang et al., Modelling hierarchical structure between dialogue policy and natural language generator with option framework for task-oriented dialogue system. arXiv preprint arXiv:2006.06814, 2020) is a dialogue pol-icy learning method to solve context-to-response generation task of Multiwoz2.0 (Budzianowski et al., 2018b). It exploits the hierarchical nature of dialogue act and response generation task by proposing an option-based framework of Hierarchical RL and variational model to learn a latent dialogue act that corresponds to natural language response. Unlike CASPI, HDNO though highlights the risk of sparsity of metric function such as success rate as reward function, resorts to shaping a proxy reward function. Use markov language model as a proxy reward function. The language model is learnt independent of the metric function. CASPI refrains from reward shaping and is independent of the nature of any underspecified metric function.

[00101] CASPI is first compared against the current state of the art methods on the context-to-response generation task defined by MultiWoz2.0. The results are tabulated at FIG. 8. CASPI adaptation of DAMD, CASPI(DAMD) are used for this task. CASPI(DAMD) performs better than other methods on three of the four performance criteria i.e success rate, inform rate and combined score. HDSA has better BLEU score. This rich expressiveness of natural language by HDSA, stems from the use of large 12-layers BERT model.

[00102] Secondly, both adaptation of CASPI(DAMD) and CASPI(MinTL) are compared on the end-to-end dialogue tasks defined by MultiWoz2.0. The results are tabulated FIG. 9. CASPI(DAMD) with it’s light weight model architecture with no pretraining on any external corpus, was able to outperform all other previous method in all evaluation criteria. This goes to show using CASPI to shepherd the gradient update process as sample weights for each dialogue turn leads to a model that’s well aligned with true objective of the task. CASPI(MinTL) with its robust pretrained model out performs CASPI(DAMD) by a large margin. This goes to show the ease of adaptation of existing methods with CASPI.

[00103] Inverse reinforcement learning, coupled with off-policy policy learning and evaluation are proven to be sample efficient. CASPI is competitive with other sample efficiency techniques, such as data augmentation and transfer learning as performed by (Zhang et al.) and (Lin et al.) respectively. To demonstrate the hypothesis, CASPI is tested against baseline in a low sample complexity regime. For experimental setup, the low resource testing strategy from (Lin et al.). The CASPI model is trained on 5%, 10%, and 20% of the training data and compared with other baselines on end-to-end dialogue and context-to-response generation tasks, FIGS. 10-11 list the results. In end-to-end task, CASPI(MinTL) trained only on 10% of data was able to out perform previous state of the art method, MinTL trained on 100% data on two of the three performance metrics. On the context-to-response generation task, CASPI(DAMD) trained on 75% of the data was able to match 100% data performance of HDNO. This goes to show that having the right reward function to guide the budget of the gradient update process to reach the true objective is important in extremely low resource setting.

[00104] FIG. 12 shows an example of generated responses by different ToD models, such as MinTL, CASPI (MinTL), DAMD and Simple TOD. [00105] FIG. 13 shows the human evaluation on criteria appropriateness and fluency. The mean and variance of the score is shown. The Appropriateness scores of MinTL 1301, SimpleTOD 1302 and DAMD 1304 are compared against the CASPI (MinTL) appropriateness 1303. The fluency scores of MinTL 1311, SimpleTOD 1312 and DAMD 1314 are compared against the CASPI (MinTL) fluency 1313. The results of the evaluation. CASPI(MinTL) 1303 outperforms all other models 1301, 1302 and 1304 in appropriateness score. While fluency scores of CASPI(MinTL) 1313, MinTL 1311 and SimpleTOD 1312 are comparable to each other.

[00106] As automatic dialogue evaluation metrics are biased and doesn’t truly reflect the human objective but in CASPI these very same dialogue evaluation metrics are used to learn reward R(s, a, g). To bridge this gap, the following human-in-the-loop (HITL) experiment is conducted: a pair CASPI(MINTL) models with different seeds are trained, on 5% of Multiwoz2.0 dataset. These pair of models are then used to predict on 0.5% of Mul- tiwoz2.0 train data (40 dialogues) and had a human score these pairs of generated response relative to each other. The model is then trained for reward R(s, a, g) using pairwise causal reward learning as described in relation to FIGS. 6A-6C, where examples of the mini batch are randomly sampled either from human scored examples or the ones scored by the automatic evaluation metric as show in Fig. 13. A fresh CASPI(MINTL) model is then trained on the original 5% of data and the learnt R(s, a, g). Human evaluation of the trained model is performed on 24 dialogues form the test using 3 participants. Fig: 14 shows the performance. With the HITL score in the reward learning, a boost in performance in both the human evaluation criteria: appropriateness and fluency. The 5% data CASPI(MINTL)’s human appropriateness score is increased from 1401 to 1402, now comparable to 100% data DAMD. Fluency score also increased from 1411 to 1412. This goes to show the versatility of the pairwise causal reward learning. With enough richness of the neural network used, the pairwise causal reward learning can generalize to unknown dialogue evaluation criteria.

[00107] FIG. 15 shows the same conversation between a tourist and information center agents that is shown in FIG. IB, with example reward value R(st, at, g), that pairwise causal reward learning has predicted against each turn. It is observed that Turn#3 has received the highest reward, retrospectively we realize that this is the turn the transaction happens which is crucial and risk averse turn in a dialogue, which is captured by the success rate of the automatic evaluation metric. Turn#2 gets the next best reward which captures crucial information need for transaction to happen in Turn#3. Turn#4 gets reward an order lower than Turn#3 & 2 because other than nicety, it doesn’t contribute much to the success of the conversation. It should be noted that it is typical Turn#4 will appear in almost all conversation and in supervised learning, it’ll be receiving the highest share of gradient.

The learnt reward redistributes the gradient budget that is aligned to the success of the dialogue objective.

[00108] FIG. 16 shows different types of behavior CASPI agents sometime exhibit, especially when trained in low sample regime. Greedy agent: In certain domains, the agents has a tendency to book a service before it has gathered all the required information or before the user requested or agreed for booking a service. The first example in FIG. 16 demonstrate this behavior. Here the user has requested for a taxi, before enough information such as destination or time of departure are gathered, the agent books the taxi. This happens because there are gaps in automatic evaluation metrics. A low BLEU score and relatively high inform and success rate might indicate greedy agent behaviour. Other reasons for low BLEU score includes: lack of diversity in the responses or malformation of response.

[00109] Cautious agent: The agent tends to be cautious by providing long winded replies packed with more information than needed. Agent tend to do this so as not to run the risk of loosing rewards through information rate. This behavior is demonstrated in the second example in FIG. 16. These subtle behavior demonstrates gap in automatic evaluation metrics, which may be reduced by using Human in the loop evaluation as shown in FIG. 7.

[00110] Some examples of computing devices, such as computing device 100 may include non-transitoiy, tangible, machine readable media that include executable code that when run by one or more processors (e.g., processor 110) may cause the one or more processors to perform the processes of method 200. Some common forms of machine readable media that may include the processes of method 200 are, for example, floppy disk, flexible disk, hard disk, magnetic tape, any other magnetic medium, CD-ROM, any other optical medium, punch cards, paper tape, any other physical medium with patterns of holes, RAM, PROM, EPROM, FLASH-EPROM, any other memory chip or cartridge, and/or any other medium from which a processor or computer is adapted to read.

[00111] This description and the accompanying drawings that illustrate inventive aspects, embodiments, implementations, or applications should not be taken as limiting. Various mechanical, compositional, structural, electrical, and operational changes maybe made without departing from the spirit and scope of this description and the claims. In some instances, well-known circuits, structures, or techniques have not been shown or described in detail in order not to obscure the embodiments of this disclosure. Like numbers in two or more figures represent the same or similar elements.

[00112] In this description, specific details are set forth describing some embodiments consistent with the present disclosure. Numerous specific details are set forth in order to provide a thorough understanding of the embodiments. It will be apparent, however, to one skilled in the art that some embodiments may be practiced without some or all of these specific details. The specific embodiments disclosed herein are meant to be illustrative but not limiting. One skilled in the art may realize other elements that, although not specifically described here, are within the scope and the spirit of this disclosure. In addition, to avoid unnecessary repetition, one or more features shown and described in association with one embodiment may be incorporated into other embodiments unless specifically described otherwise or if the one or more features would make an embodiment non-functional.

[00113] Although illustrative embodiments have been shown and described, a wide range of modification, change and substitution is contemplated in the foregoing disclosure and in some instances, some features of the embodiments may be employed without a corresponding use of other features. One of ordinary skill in the art would recognize many variations, alternatives, and modifications. Thus, the scope of the invention should be limited only by the following claims, and it is appropriate that the claims be construed broadly and in a manner consistent with the scope of the embodiments disclosed herein.

Claims

WHAT IS CLAIMED IS:

1. A method for causal-aware safety policy improvement in task-oriented learning, comprising: receiving a training dataset comprising a plurality of dialogues, wherein the plurality of dialogues includes a first dialogue rollout generated according to a latent stochastic behavior policy; repeatedly sampling the training dataset for a number of times to generate a number of training subsets and a number of validation subsets; training a task-oriented dialogue model based on a cross-entropy loss using training data in a first training subset of the number of training subsets; generating, by the task-oriented dialogue model, , predicted dialogue rollouts based on dialogue data in a first validation subset of the number of validation subsets; adding the predicted dialogue rollouts to a pairwise causal learning subset; sampling a pair of dialogue rollouts from the pairwise causal learning subset; and training the task-oriented dialogue model based on a binary cross-entropy loss between a preferred probability between the pair of dialogue rollouts and a normalized metric score based on the pair of dialogue rollouts.

2. The method of claim 1, further comprising: retrieving a dataset from the number of training subsets or the number of validation subsets; and training the task-oriented dialogue model by minimizing a cross -entropy of a predicted dialogue action conditioned on a current state of a dialogue according to a target policy using dialogue data from the retrieved dataset.

3. The method of claim 2, wherein the predicted dialogue rollouts are repeatedly generated according to the target policy by iterating the number of validation subsets.

4. The method of claim 1, wherein the training the task-oriented dialogue model based on a binary cross-entropy loss is performed by repeatedly sampling different pairs of dialogue rollouts from the pairwise causal learning subset and re-training the task- oriented dialogue model based on the binary cross-entropy loss until a convergence is reached in training.

5. The method of claim 1, wherein the training the task-oriented dialogue model, based on a binary cross-entropy loss further comprises: encoding, via three bi-LSTM layers, respectively, a goal, a belief state and a dialogue act or response sequence at each dialogue turn of each of the sampled pair of dialogue rollouts into three encoded representations; concatenating the three encoded representations; feeding the concatenated encoded representations to one or more feed-forward layers that generates a reward prediction for each dialogue turn; summing generated reward predictions into a dialogue reward for each one of the sampled pair of dialogue rollouts; computing the preferred probability between the pair of dialogue rollouts based on dialogue rewards corresponding to the sampled pair of dialogue rollouts; and computing the binary cross-entropy loss between the preferred probability between the pair of dialogue rollouts and the normalized metric score based on the pair of dialogue rollouts.

6. The method of claim 5, wherein the preferred probability between the pair of dialogue rollouts is computed using normalization or a softmax function.

7. The method of claim 1, further comprising: repeatedly training the task-oriented dialogue model based on a policy optimization loss that optimizes over the target policy using the training dataset until a training convergence is reached.

8. The method of claim 7, wherein the policy optimization loss is computed by: generating, by the task-oriented dialogue model, a first predicted action distribution based on a current state of a dialogue according to a target policy; computing a first discounted sum of future reward based on a discount parameter and a reward function of actions and states of the dialogue according to the latent behavior policy; computing a first loss objective based on a first expectation of the first discounted sum of future reward and the first predicted action distribution, wherein the first expectation is taken over a probability distribution of the states and the actions according to the latent stochastic behavior policy; generating, by the task-oriented dialogue model, a second predicted action distribution based on a current observation from a time series of observations according to the target policy; computing a second discounted sum of future reward based on the discount parameter and the reward function for a specific rollout, wherein the second discounted sum of future reward is a collapsed near-deterministic approximation of the first discounted sum of future reward; computing a second loss objective based on a second expectation of the second discounted sum of future reward and the second predicted action distribution, wherein the second expectation is taken over an average of the observations across the training dataset; and computing a sum of the first loss objective and the second loss objective.

9. The method of claim 8, further comprising: computing a gradient update component based on a learnt reward from the reward function of actions and states of the dialogue and a gradient of the target policy of the actions conditioned on the states and parameters of the task-oriented dialogue model; and updating the parameters of the task-oriented dialogue model using the gradient update component.

10. The method of claim 1, further comprising: randomly selecting a dialogue turn during validation of the trained task-oriented dialogue model; and receiving a set of manually created evaluation scores of a prediction on the dialogue turn from a plurality of evaluators.

11. A system for causal-aware safety policy improvement in task-oriented learning, the system comprising: a communication interface receiving a training dataset comprising a plurality of dialogues, wherein the plurality of dialogues includes a first dialogue rollout generated according to a latent stochastic behavior policy; a memory storing a plurality of processor-executable instructions; and a processor reading the plurality of processor-executable instructions from the memory to perform operations comprising: repeatedly sampling the training dataset for a number of times to generate a number of training subsets and a number of validation subsets; training a task-oriented dialogue model based on an entropy loss using training data in a first training subset of the number of training subsets; generating, by the task-oriented dialogue model, predicted dialogue rollouts from dialogue data in a first validation subset of the number of validation subsets; adding the predicted dialogue rollouts to a pairwise causal learning subset; sampling a pair of dialogue rollouts from the pairwise causal learning subset; and training the task-oriented dialogue model based on a binary cross-entropy loss between a preferred probability between the pair of dialogue rollouts and a normalized metric score based on the pair of dialogue rollouts.

12. The system of claim 11, wherein the operations further comprise: retrieving a dataset from the number of training subsets or the number of validation subsets; and training the task-oriented dialogue model by minimizing an entropy of a predicted dialogue action conditioned on a current state of a dialogue according to a target policy using dialogue data from the retrieved dataset.

13. The system of claim 12, wherein the predicted dialogue rollouts are repeatedly generated according to the target policy by iterating the number of validation subsets.

14. The system of claim 11, wherein the operation of training the task-oriented dialogue model based on a binary cross-entropy loss is performed by repeatedly sampling different pairs of dialogue rollouts from the pairwise causal learning subset and re-training the task-oriented dialogue model based on the binary cross-entropy loss until a convergence is reached in training.

15. The system of claim 11, wherein the operation of training the task-oriented dialogue model based on a binary cross-entropy loss further comprises: encoding, via three bi-LSTM layers, respectively, a goal, a belief state and a dialogue act or response sequence at each dialogue turn of each of the sampled pair of dialogue rollouts into three encoded representations; concatenating the three encoded representations; feeding the concatenated encoded representations to one or more feed-forward layers that generates a reward prediction for each dialogue turn; summing generated reward predictions into a dialogue reward for each one of the sampled pair of dialogue rollouts; computing the preferred probability between the pair of dialogue rollouts based on dialogue rewards corresponding to the sampled pair of dialogue rollouts; and computing the binary cross-entropy loss between the preferred probability between the pair of dialogue rollouts and the normalized metric score based on the pair of dialogue rollouts.

16. The system of claim 15, wherein the preferred probability between the pair of dialogue rollouts is computed using normalization or a softmax function.

17. The system of claim 11, wherein the operations further comprise: generating, by the task-oriented dialogue model, a first predicted action distribution based on a current state of a dialogue according to a target policy; computing a first discounted sum of future reward based on a discount parameter and a reward function of actions and states of the dialogue according to the latent behavior policy; computing a first loss objective based on a first expectation of the first discounted sum of future reward and the first predicted action distribution, wherein the first expectation is taken over a probability distribution of the states and the actions according to the latent stochastic behavior policy; generating, by the task-oriented dialogue model, a second predicted action distribution based on a current observation from a time series of observations according to the target policy; computing a second discounted sum of future reward based on the discount parameter and the reward function for a specific rollout, wherein the second discounted sum of future reward is a collapsed near-deterministic approximation of the first discounted sum of future reward; computing a second loss objective based on a second expectation of the second discounted sum of future reward and the second predicted action distribution, wherein the second expectation is taken over an average of the observations across the training dataset; and computing a sum of the first loss objective and the second loss objective.

18. The system of claim 17, wherein the operations further comprise: computing a gradient update component based on a learnt reward from the reward function of actions and states of the dialogue and a gradient of the target policy of the actions conditioned on the states and parameters of the task-oriented dialogue model; and updating the parameters of the task-oriented dialogue model using the gradient update component.

19. The system of claim 11, wherein the operations further comprise: randomly selecting a dialogue turn during validation of the trained task-oriented dialogue model; and receiving a set of manually created evaluation scores of a prediction on the dialogue turn from a plurality of evaluators.

20. A non-transitoiy processor-readable storage medium storing a plurality of processor-executable instructions for causal-aware safety policy improvement in task- oriented learning, the instructions being executed by a processor to perform operations comprising: receiving a training dataset comprising a plurality of dialogues, wherein the plurality of dialogues includes a first dialogue rollout generated according to a latent stochastic behavior policy; repeatedly sampling the training dataset for a number of times to generate a number of training subsets and a number of validation subsets; training a task-oriented dialogue model based on an entropy loss using training data in a first training subset of the number of training subsets; generating, by the task-oriented dialogue model, predicted dialogue rollouts from dialogue data in a first validation subset of the number of validation subsets; adding the predicted dialogue rollouts to a pairwise causal learning subset; sampling a pair of dialogue rollouts from the pairwise causal learning subset; and training the task-oriented dialogue model based on a binary cross-entropy loss between a preferred probability between the pair of dialogue rollouts and a normalized metric score based on the pair of dialogue rollouts.