CN111400468B

CN111400468B - Conversation state tracking system and method, man-machine conversation device and method

Info

Publication number: CN111400468B
Application number: CN202010165025.2A
Authority: CN
Inventors: 俞凯; 陈志�
Original assignee: Sipic Technology Co Ltd
Current assignee: Sipic Technology Co Ltd
Priority date: 2020-03-11
Filing date: 2020-03-11
Publication date: 2022-07-15
Anticipated expiration: 2040-03-11
Also published as: CN111400468A

Abstract

The invention discloses a dialogue state tracking system and a method and a man-machine dialogue device and a method, wherein the dialogue state tracking system comprises: a hierarchical statement encoder configured to hierarchically encode the conversational history statement; the coarse-grained decoder is configured to decode according to the layering coding result of the layering statement coder to obtain a coarse-grained conversation state; a coarse-grained coder configured to coarse-grained encode the coarse-grained dialog state; a state decoder configured to determine a serialized dialog state based on the hierarchical encoding result and the coarse-grained encoding result. The present invention represents dialog states as a structured sequence for the first time and proposes a dialog state tracking model generated using coarse-to-fine-grained sequences. The model provided by the invention is the generative dialogue state tracking, so that the slot values do not need to be known in advance, and the dialogue states in a sequence form can ensure that the slot values are mutually visible when the dialogue states are generated.

Description

Conversation state tracking system and method, man-machine conversation device and method

Technical Field

The invention relates to the technical field of man-machine conversation, in particular to a conversation state tracking system and method and a man-machine conversation device and method.

Background

The state tracking method of the prior art includes the following two types:

(1) dialog state classification model: the classification model in the dialogue state tracking method is groove value independent prediction when predicting the dialogue state, and the dialogue historical statement needs to be coded again for each groove. In addition, the classification method needs to know all possible values of the predicted slot in advance, which is not practical in practical application scenarios because the possible values corresponding to the slot such as 'location' are many, and even the values of some slots are not enumerable, which cannot be solved by the classification method.

(2) Slot-value independent dialog state generation model: the purpose of the generative dialogue state tracking model is to solve the problem that the slot value needs to be given in advance in the above classification method. The dialog state generative model directly finds the specific value of the corresponding slot in the dialog history, but the dialog generative model proposed earlier is still predicted independently according to the slot. A problem with such generative dialog state tracking systems is that the computational complexity of generating dialog states is high.

The above two methods have at least the following disadvantages:

dialog state classification model: depending on the slot values that need to be predefined, the computational complexity is high. Slot-value independent dialog state generation model: the calculation complexity is high, the prediction processes of the slot values are mutually independent, and no relation is established.

Disclosure of Invention

An embodiment of the present invention provides a system and a method for tracking a dialog state, which are used to solve at least one of the above technical problems.

In a first aspect, an embodiment of the present invention provides a dialog state tracking system, including:

a hierarchical statement encoder configured to hierarchically encode the conversational history statement;

the coarse-grained decoder is configured to decode according to the layering coding result of the layering statement coder to obtain a coarse-grained conversation state;

a coarse-grained encoder configured to coarse-grained encode the coarse-grained dialog state;

a state decoder configured to determine a serialized dialog state based on the hierarchical encoding result and the coarse-grained encoding result.

In a second aspect, an embodiment of the present invention provides a human-machine conversation device, including the conversation state tracking system described in the foregoing embodiments.

In a third aspect, an embodiment of the present invention provides a dialog state tracking method, which is applied to the dialog state tracking system in the foregoing embodiment, where the method includes:

the hierarchical statement encoder carries out hierarchical encoding on the conversation historical statement;

the coarse-grained decoder decodes according to the layering coding result of the layering statement coder to obtain a coarse-grained conversation state;

the coarse-grained coder is used for performing coarse-grained coding on the coarse-grained dialog state;

and the state decoder determines a serialized conversation state according to the layering coding result and the coarse-granularity coding result.

In a fourth aspect, an embodiment of the present invention provides a human-machine conversation method, including the steps of the conversation state tracking method described in the foregoing embodiment.

In a fifth aspect, an embodiment of the present invention provides a storage medium, where one or more programs including execution instructions are stored, and the execution instructions can be read and executed by an electronic device (including but not limited to a computer, a server, or a network device, etc.) to perform any one of the above-described dialog state tracking method and/or man-machine dialog method of the present invention.

In a sixth aspect, an electronic device is provided, which includes: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor, the instructions being executable by the at least one processor to enable the at least one processor to perform any of the above dialog state tracking methods and/or human machine dialog methods of the present invention.

In a seventh aspect, an embodiment of the present invention further provides a computer program product, where the computer program product includes a computer program stored on a storage medium, and the computer program includes program instructions, when the program instructions are executed by a computer, the computer is caused to execute any one of the above dialog state tracking method and/or the man-machine dialog method.

The embodiment of the invention has the beneficial effects that: the present invention represents dialog states as a structured sequence for the first time and proposes a dialog state tracking model generated using coarse-to-fine-grained sequences. The model provided by the invention is the generative dialogue state tracking, so that the slot values do not need to be known in advance, and the dialogue states in a sequence form can ensure that the slot values are mutually visible when the dialogue states are generated.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a representation of a structured state and a dialog-like structural intent in accordance with the present invention;

FIG. 2 is a functional block diagram of one embodiment of a dialog state tracking system of the present invention;

FIG. 3 is a schematic diagram of another embodiment of a dialog state tracking system according to the present invention;

FIG. 4 is a flowchart of a dialog state tracking method according to an embodiment of the present invention;

FIG. 5a is a graph of the impact of dialog length on the accuracy of these two associations in the present invention;

FIG. 5b shows the effect of ground truth length on the accuracy of these two combinations

Fig. 6 is a schematic structural diagram of an embodiment of an electronic device of the invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention. It should be noted that, in the present application, the embodiments and features of the embodiments may be combined with each other without conflict.

As used in this disclosure, "module," "device," "system," and the like are intended to refer to a computer-related entity, either hardware, a combination of hardware and software, or software in execution. In particular, for example, an element may be, but is not limited to being, a process running on a processor, an object, an executable, a thread of execution, a program, and/or a computer. Also, an application or script running on a server, or a server, may be an element. One or more elements may be in a process and/or thread of execution and an element may be localized on one computer and/or distributed between two or more computers and can be operated by various computer-readable media. The elements may also communicate by way of local and/or remote processes in accordance with a signal having one or more data packets, e.g., signals from data interacting with another element in a local system, distributed system, and/or across a network of the internet with other systems by way of the signal.

Finally, it is also noted that, in the present invention, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Furthermore, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Fig. 1 shows a structured state representation and a dialog structure intent, and the data source of this example is from the data set MultiWOZ, it is noted that this example source data is english. In the present invention, we represent the dialog state as a structured dialog state as shown in FIG. 1, which includes the dialog domain, the slot and the corresponding slot value. The dialogue state in the form of the sequence can ensure that different groove values are visible, and the model can model different groove values. Compared with a classification method, the representation can generate the whole dialogue state at one time, and the dialogue history needs to be encoded again for each slot value, so that the expandability of the model is increased. In order to simplify the process of generating dialog states for the dialog state tracking model, the proposed model first predicts coarse-grained dialog states, which only contain dialog fields and slots, and do not contain corresponding specific slot values. And then generating a fine-grained dialog state based on the predicted coarse-grained dialog state, and the method can simplify the process of generating the dialog state.

As shown in FIG. 2, an embodiment of the present invention provides a dialog state tracking system 100, comprising:

a hierarchical sentence encoder 110 configured to hierarchically encode the conversational history sentence;

a coarse-grained decoder 120 configured to decode according to the hierarchical encoding result of the hierarchical statement encoder 110 to obtain a coarse-grained dialog state;

a coarse-grained encoder 130 configured to coarse-grained encode the coarse-grained dialog state;

a state decoder 140 configured to determine a serialized dialog state based on the layered coding result and the coarse-grained coding result.

In this embodiment, the hierarchical statement encoder encodes the conversational history statement, the coarse-granularity decoder decodes the conversational history statement according to the result obtained by the hierarchical encoder to obtain a coarse-granularity conversational state only including a conversational domain and a slot value, and then the coarse-granularity encoder encodes the obtained coarse-granularity conversational state. The resulting coding result is used together with the previous dialog history encoder to provide information for generating the final dialog state, and the decoding process is performed by a fine-grained decoder.

The present invention represents dialog states as structured sequences for the first time and presents a dialog state tracking model generated using coarse-to-fine-grained sequences. The model provided by the invention is generative dialogue state tracking, so that the slot values do not need to be known in advance, and the dialogue states in a sequence form ensure that the slot values are visible mutually when the dialogue states are generated.

Fig. 3 is a schematic structural diagram of an embodiment of the dialog state tracking system of the present invention, in which the hierarchical statement encoder 110 includes a word-level encoder 111 and a statement-level encoder 112 connected in sequence, in which,

the word-level encoder 111 is configured to perform word-level encoding on the sentences in the dialogue history;

the sentence-level encoder 112 is configured to perform sentence-level encoding on the word-level encoding result output by the word-level encoder 111, and output the sentence-level encoding result to the coarse-granularity decoder 120 and the state decoder 140.

In some embodiments, the word-level encoder is configured with a first memory for storing an attention mechanism for word-level decoding to an encoding model, the first memory being connected to the state decoder;

the sentence set encoder is configured with a second memory for storing a mechanism of attention for sentence-level decoding to an encoding model, the second memory being coupled to the coarse-granularity encoder and the state decoder. Illustratively, the word-level encoder, sentence-level encoder, and coarse-level decoder include bidirectional GRU units, and the coarse-level decoder and coarse-level encoder include unidirectional GRU units.

In this embodiment, the decoder-to-encoder attention of the coarse-granularity decoder need only be done through the statement vector. The coarse-grained encoder encodes the predicted state architecture into a series of architecture vectors. Unlike coarse-grained decoders, the bin values that are specific to a state decoder can be copied directly from the dialog statement. There are two levels of decoder-to-encoder attention in the state decoder model: sentence-level attention with sentence vectors (stored in the sentence memory) and word-level attention with word vectors (stored in the word memory). When a previous word is present in the state architecture, the corresponding architecture vector is provided to the state decoder in addition to the embedding using the previous word.

In the embodiment of the invention, the attention mechanism from the sentence level decoding to the coding model acquires the dialogue fields and the corresponding slots mentioned in the dialogue, the attention mechanism from the word level decoding to the coding model better extracts the specific values of the relevant slots in the dialogue, and the two attention mechanisms are hierarchical attention mechanisms which respectively enable the model to understand different levels of the dialogue, so that the dialogue state can be determined more accurately.

Fig. 4 is a flowchart of an embodiment of a dialog state tracking method according to the present invention, which is applied to the dialog state tracking system described in the foregoing embodiment. As shown in fig. 4, the dialog state tracking method includes:

s10, the hierarchical statement coder carries out hierarchical coding on the conversation history statement;

s20, the coarse-grained decoder decodes according to the layering coding result of the layering statement coder to obtain a coarse-grained dialog state;

s30, the coarse-grained coder carries out coarse-grained coding on the coarse-grained dialog state;

and S40, the state decoder determines the serialized conversation state according to the layering coding result and the coarse-grained coding result.

In some embodiments, the hierarchical sentence encoder comprises a word-level encoder and a sentence-level encoder connected in sequence, wherein,

the hierarchical statement encoder hierarchically encodes the conversational history statement includes:

the word-level encoder is configured to perform word-level encoding on the sentences in the dialogue history;

and the statement-level encoder is configured to perform statement-level encoding on the word-level encoding results output by the word-level encoder, and output the statement-level encoding results to the coarse-granularity decoder and the state decoder.

In some embodiments, the word-level encoder is configured with a first memory for storing an attention mechanism for word-level decoding to an encoding model, the first memory being connected to the state decoder; the sentence set encoder is configured with a second memory for storing a mechanism of attention for sentence-level decoding to an encoding model, the second memory being connected to the coarse-granularity encoder and the state decoder;

the coarse-grained decoder decoding according to the layering coding result of the layering statement coder to obtain a coarse-grained dialogue state comprises the following steps:

the coarse-grained decoder decodes according to the hidden state of the statement-level encoder and the attention mechanism from the statement-level decoding to the coding model to obtain a coarse-grained dialogue state;

the state decoder determining a serialized dialog state according to the hierarchical encoding result and the coarse-grained encoding result comprises:

and the state decoder determines a serialized conversation state according to the attention mechanism from the word-level decoding to the coding model, the attention mechanism from the statement-level decoding to the coding model, the hidden state of the statement-level encoder and the coarse-grained coding result.

In some embodiments, the word-level encoder, sentence-level encoder, and coarse-grain decoder comprise bidirectional GRU units, and the coarse-grain decoder and coarse-grain encoder comprise unidirectional GRU units.

The dialog state tracking method according to the embodiment of the present invention may be used in the dialog state tracking system according to the embodiment of the present invention, and accordingly achieves the technical effects achieved by the dialog state tracking system according to the embodiment of the present invention, and details are not described here. In the embodiment of the present invention, the relevant functional module may be implemented by a hardware processor (hardware processor).

In some embodiments, an embodiment of the present invention provides a human-machine conversation device, including the conversation state tracking system described in any of the foregoing embodiments.

In some embodiments, an embodiment of the present invention provides a human-machine conversation method, including the steps of the conversation state tracking method described in any of the foregoing embodiments.

In an end-to-end dialog system, the goal of the dialog state tracker is to accurately find a compact representation of the current dialog state from the entire dialog context. However, existing dialog state tracking methods treat dialog states as a combination of different disjoint triples (domain-slot values). The present invention redefines the dialog state tracking problem as a sequence generation problem with a structured state representation. On the basis, a method (CREDIT model) for generating a coarse-grained to fine-grained sequence dialogue state is provided, a structured sequence dialogue state is used for representing, and natural language measurement is optimized through a strategy gradient method, so that the pre-trained model can be further fine-tuned. As with all generative state tracking methods, CREDIT does not rely on a predefined dialog ontology that enumerates all possible slot values. Empirical results show that our tracker achieves encouraging joint target accuracy in five fields of the MultiWOZ 2.0 and MultiWOZ 2.1 datasets. The experiments will be described in detail below:

1. introduction to the design reside in

Dialog state tracking is a core component of task-oriented dialog systems for estimating a compact representation of the dialog context between a dialog agent and a user. The dialog agent decides how to respond to the user in this compact representation, which is also referred to as a confidence state. The confidence state typically includes slot values for different areas of the conversation that the user has confirmed, e.g., hotel (fair; place-city center). In conventional dialog systems, the upstream component of the dialog state tracking model is Natural Language Understanding (NLU). The dialog state is updated (parsed by the NLU) based on the semantic representation of the user response in the current dialog turn. Nowadays, the end-to-end state tracking method, which combines NLU with traditional DST and directly predicts the dialog state from the dialog statement, is receiving more and more attention from the research field.

Recently proposed end-to-end state tracking methods can be divided into two categories: a classification method and a generation method. The classification method typically requires that all possible bin values are given by the ontology. However, in an actual dialog scheme, some slot values cannot be enumerated. To alleviate this problem, generation methods have been proposed in which slot values are generated directly from the dialog history. As with the classification method, most generation methods generate slot values one by one until all slots on different domains are visited. The calculation cost is proportional to the number of slots. This problem is exacerbated when the conversational task becomes complex.

In the present invention, we formalize the dialog state tracking task as a specific instance of dialog-level semantic parsing to solve the above-mentioned problems. As shown in fig. 1, we redefine the confidence state as a structured state representation, a domain delimiter to group together all bin value pairs of the corresponding domain. CREDIT is a coder-decoder model that maps conversational utterances into structured state representations. CREDIT first generates a coarse state representation containing only the realm-slot pairs. The final state is then inferred from the coarse states and statements. Since the structured state representation is a sequence, linguistic indicators (e.g., BLEU score to measure the estimated result of CREDIT) can be used. Due to cross entropy loss in supervised learning, language indexes of the structured state cannot be directly optimized. In the present invention, we further use reinforcement learning methods to fine-tune the pre-trained tracking model.

Single step dialog state generation and performance boosting are the main advantages of CREDIT. The contribution of the invention is in three parts:

1) in the present invention, we use a coarse-to-fine decoding method to generate the structured state. First, the coarse states, which are composed of the field-slot pairs, are decoded. The fine state is generated based on the coarse state and the conversational utterances using a replication mechanism.

2) Our proposed CREDIT is a single step state generation method that does not require traversing all possible slots to generate the corresponding dialog state. CREDIT directly maps conversational utterances to structured states with an Inference Time Complexity (ITC) of O (1).

3) And in order to improve the performance of CREDIT on the language index, the index is regarded as reward, and parameters of the pre-training tracking model are continuously adjusted in a fine mode under the condition of reinforcement learning loss.

2. Related work

Multi-domain DST: with the release of Multi-WOZ datasets, which is one of the largest task-oriented dialog datasets, many Dialog State Tracking (DST) methods for Multi-domain tasks have been proposed. FJST and HJST are two simple methods. They encode the dialog history directly into a vector and predict all slot values accordingly. Instead of concatenating the entire dialog history directly as input in the FJST, it is a hybrid approach to adopt a layered model as the encoder, improving the HJST by adding a value replication mechanism. As the pre-trained model evolves, SUMBT first encodes dialog statements and slots into vectors using pre-trained BERTs, and calculates candidate values from the corresponding slots. TRADE is a method of generating slot values directly from dialog histories. Following TRADE, there are several improved versions, DSDST, DST-picklist and SOM-DST. DS-DST and DST-picklist are also proposed, which divide the slot into an untcountable type and a countable type and generate the slot value in a hybrid method (e.g., HyST). In contrast to DS-DST, DST-picklist knows all candidates for bins, including the non-countable bins. Unlike TRADE, SOM-DST takes dialog history and previous states as inputs and modifies the state with dialog history to the current state. These improved versions rely on pre-trained BERTs as the vocoder.

Different from the method, the DST Reader expresses the DST task as a machine reading task, and solves the multi-field task by using a corresponding method. In the present invention, we represent the DST task as a semantic parsing task at dialog level and solve it using a coarse-to-fine decoding method. COMER is the closet method of CREDIT, our proposed method, which directly generates all-field tuple of slot values. However, the input to COMER includes the previous predicted state and the utterance of the current round of dialog, and the generated state is a tree rather than a structured state sequence.

Our proposed CREDIT (Coarse-to-fine DIalogue state Tracking) improves the decoding method from Coarse to fine by a two-stage attention mechanism. More recently, the dialog summary task is solved using a hierarchical generation approach, which first generates predefined dialog summary actions and then generates a final summary.

3. CREDIT model

Fig. 3 is a schematic structural diagram of an embodiment of the dialog state tracking system of the present invention, which includes a hierarchical sentence coder, a coarse-granularity decoder, a coarse-granularity coder and a state decoder, wherein a word memory stores word vectors coded by the word-level coder, and a sentence memory collects sentence vectors coded by the sentence-level coder. We do not need to predict all slot values, but rather generate a dialog state in which the "none" value of the slot is not contained, as shown in fig. 1. Similar to the semantic parsing task, we redefine the dialog state as a structured representation. However, unlike traditional semantic parsing tasks, which map a single sentence to a structured emoticon, the dialog state tracking task requires reasoning among all sentences of the entire dialog to obtain the complete dialog state.

The layered sentence encoder comprises two different encoders: word-level encoders and sentence-level encoders. The word-level encoder encodes each word of the sentence into a word vector. The sentence-level encoder encodes each sentence of the dialog as a sentence vector using the word vector with attention mechanism.

The coarse-grained decoder predicts the dialog state framework in a dialog, which dialog state is composed of the domain-slot pairs mentioned in the dialog, as shown in fig. 1. The decoder to encoder attention of the coarse-granularity decoder need only be done through the statement vector, since no values need to be copied from the statement.

The coarse-grained encoder encodes the predicted state architecture as a series of architecture vectors. Unlike the coarse-grained decoder, the slot values specific to the state decoder can be copied directly from the dialog statements. There are two levels of decoder-to-encoder attention in the state decoder model: sentence-level attention with sentence vectors (stored in the sentence memory) and word-level attention with word vectors (stored in the word memory). When a previous word is present in the state architecture, the corresponding architecture vector is provided to the state decoder in addition to the embedding using the previous word.

We define x ═ x¹,x²,…x^|x|As a set of dialog sentences, y ═ y₁,y₂,…y_|y|As its structured state representation. Our goal is to learn from a spoken sentence speech x with a structured state representation yA state tracker. We estimate p (y | x), i.e. given that x represents the conditional probability of state y. In the coarse to fine expression, we decompose p (y | x) into a two-stage generation process:

p(y|x)＝p(y|x,a)p(a|x) (1)

wherein a ═ a₁,a₂,…a_|a|Is a state architecture that represents the y abstraction.

3.1 hierarchical statement encoder

A hierarchical statement encoder encodes a statement as a sequence of word vectors and context statement vectors.

A word-level encoder: we use a bidirectional Gate Recursion Unit (GRU) encoding statement. The input to the word-level encoder is a sentence of a dialog, denoted as

Wherein x is_j ⁱJ-th word embedding representing i-th dialogue sentence, and d_embIndicating the embedding size. The encoded sentence is expressed by the following equations 2-4

Wherein, d_hddIs the size of the hidden layer [, ]]Representing a connection between two vectors, f^x _GRUIs a bi-directional GRU function at the word level.

Statement level encoder: we use an attention mechanism to make the encoder module more aware of the words associated with the states. For statement xⁱEach word x in (1)ⁱ _jThe formula for calculating the attention score is:

wherein, the first and the second end of the pipe are connected with each other,

and

is a trainable parameter, apply hⁱThe above careful summary to obtain a statement vector:

u＇_i＝softmax(score(xⁱ))hⁱ

here, we get the statement level representation u' ═ { u ═ b₁′,u₂′,…，u_|x|'}. Due to hⁱThe token vector in (b) does not contain any context information, so the statement vector u_i' is context-free. We will use context-free statement vectors u_i' feed to another bidirectional GRU to obtain a context statement level representation u ═ { u ═ in the following way₁,u₂,…，u_|x|}：

Wherein f is^u _GRUIs a voicing level bi-directional GRU function.

3.2 coarse-grained decoder

The coarse-grained decoder learns the computation p (a | x) and generates a state structure from the dialog context. We use GRU to decode the state architecture. At the t-th time step of coarse-grained decoding, the concealment vector of the decoder is formed

And (c) calculating, wherein,

is the embedding of previously predicted words, and f^a _GRUIs a coarse grain decoding GRU function. Hidden state of the first time step in the decoder is represented by f^u _GRUInitialized by the last hidden state of the linear layer. Furthermore, we use the decoder-to-encoder attention mechanism to learn soft alignment. At the current time step t of the decoder, we use the ith oneStatement level representation u_iCalculating an attention score:

is a normalization term. Then we calculate p (a) in the following way_t|a_＜t，x)：

Wherein the content of the first and second substances,

and

is a trainable parameter, a_＜t＝[a₁，...，a_t-1]. Until the sequence is emitted "<EOB>"end marker.

3.3 Fine State Generation

The structural state is predicted conditionally by the dialog context x and the generated architectural state a. The model uses an encoder-to-decoder architecture with a two-stage attention mechanism to generate the final state.

A coarse-grained encoder: as shown in FIG. 3, the bidirectional GRU function maps an architecture state into a series of architecture vectors

(as shown in formula 7), wherein g_tIs the t-th coded vector.

The fine state decoder: the final decoder is based on two GRUs with a two-stage attention mechanism as shown in fig. 3. At the t-th time step of the decoder, the concealment vector of the bottom GRU is formed by

And (3) calculating, wherein,

comprises the following steps:

is the embedding of previously predicted words. Upon previous output y_t-1Appearing in the generated architectural state, we use the corresponding encoded architectural vector as input to the decoder. We use hidden states

Representation of u by statement level_iCalculating a first stage attention score

And obtaining a weighted and attention vector

It is the same as

equations

8 and 9. The output of the GRU at the bottom of the t time step is:

wherein, W_y1，b_y1Is a parameter.

The input to the top GRU at the t time step is

Output is as

The second stage attention score is calculated with the word-level representation h.

Wherein the content of the first and second substances,

is a normalization term, hⁱ _jIs the jth token for the ith utterance. We then calculate the generation probability p by the following formula_g(y_t|y_＜t，x，a)：

Wherein the content of the first and second substances,

W_y，b_yis a parameter, the final output distribution is a weighted sum of two different distributions,

wherein, distribution P_cSecond stage attention score including decoder

Scalar p_t ^genCalculated from the following formula:

wherein, W_gen，b_genIs a parameter.

3.4, deducing

The greedy decoding method is not suitable for structured state generation tasks, since the generated states must be able to be converted to field bin tuples at test time. In the invention, we adopt a masked greedy decoding method in testing. For example, in a structured state, the first word must be a domain or state ending word. We have devised some manual masking rules to ensure that the final generated state is translatable.

4. Learning objective

In the present invention, we use both maximum likelihood cross-entropy loss and enhancement loss. Previously only cross entropy loss was used. Since we reformat the dialog state into a sequence, the BLEU score can be used to evaluate the predicted dialog state. However, there is a gap between the cross entropy loss and the target BLEU. Therefore, we consider the BLEU score as an enhanced reward and optimize it directly.

Cross entropy loss: for the sequence generation task, the most widely used method of training the encoder-decoder model is to minimize the cross-entropy loss at each decoding step. This method is called teacher forcing algorithm. For a coarse-grained state decoder, the reference given to a dialog box is

Its cross entropy loss is:

is that

The prediction probability of (2).

Enhancing loss: minimizing the cross entropy does not guarantee an increase in the evaluation indicator BLEU. For the generation task, there is an exposure bias problem. In training processes with cross-entropy loss, previous markers are typically from a reference. However, in predicting the sequence, the input token will be sampled by the model rather than the reference. Therefore, errors may accumulate in the prediction. The BLEU score accurately indicates the degree of error of the predicted sequence, where the less overlap between the states we generate and the ground truth, the lower the BLEU score. Furthermore, the BLEU score is not trivial and therefore we cannot optimize it by gradient descent. To bridge the gap between training and reasoning, we consider the BLEU score as a reward and optimize it by a Reinforcement Learning (RL) method.

From the schema generation view of the RL, we consider the generated schema state (coarse-grained state) as a sequence of actions.

For architectural sequence decoding, at each decoding time step from P^aSampling sequence

The reward function is defined as follows:

where BLEU (·, ·) is a BLEU function, and a is a ground truth. After coarse grain decoding is implemented, we calculate the discount jackpot for each decoding action by:

where λ is the discount rate. To stabilize the RL process, the baseline value may be estimated by:

is a parameter, d_i ^aIs the i-th hidden state of the coarse-grain decoding GRU in the coarse-grain decoder described hereinbefore. According to Policy Gradient thermom, RL loss is

Also, we define the RL loss for the fine state as:

wherein the content of the first and second substances,

is a predicted state

The ith word of (2), r^yIs and r^aThe same cumulative reward function, b^yIs a baseline function, W, for state decoding^y _bse，b^y _bseIs a parameter.

Finally, we combine the functions 17, 18, 22, 23 to obtain the final loss:

wherein beta is₁，β₂，β₃Is a hyper-parameter.

5. Experiment of the invention

5.1, data set

MultiWOZ 2.0: MultiWOZ 2.0 is the largest task-oriented dialog data set for multi-domain dialog state tracking, containing 8438 multi-dialog dialogs and spanning 7 dialog domains. For the session state tracking task, there are only five fields (restaurant, hotel, scenic spot, taxi, train) for verification and testing. Domain hospitals, buses are only present in training collections.

MultiWOZ 2.1: MultiWOZ 2.1 repairs the state annotations and statements in MultiWOZ 2.0. In MultiWOZ 2.0, there are five common error types in dialog state annotations: delayed tags, multiple annotations, wrong annotations, wrongly written words and forgotten values. Specific presentation of these errors is described in the following. However, the number of sessions in MultiWOZ 2.1 is the same as that in MultiWOZ 2.0.

5.2 evaluation index

Combined target accuracy: this is a standard indicator of dialog state tracking tasks. The common target success of the dialog means that the values of all the domain slot pairs can be correctly predicted. In this work, we implemented using TRADE in this index.

BLEU fraction: because we redefine the dialog states as a sequence, the BLEU score can measure the performance of the model.

ITC: ITC represents the inference time complexity of the model, first proposed in (Ren et al, 2019).

Scale of the model: in this work, we measure the scale of the tracking model. We note that some works use pre-trained models as part of the tracking model to improve performance. In practical situations, the trade-off between parameter size and performance is critical.

5.3, training details

Similar to TRADE, we use a concatenation of Glove embedding and character embedding to initialize all embeddings. We divide the training process into two phases: a Supervised Learning (SL) pre-training phase and a Reinforcement Learning (RL) fine tuning phase. In the SL phase, we train the model using only cross entropy loss. We employ an Adam optimizer with a learning rate of 1 e-4. We set the batch size to 16 and the rate of interruptions to 0.2. At the RL stage, we modify the learning rate to 1e-5 and the conjugate rate to 0. The hyperparameter in equation 16 is β₁＝0.25；β₂＝0.25；β₃0.25. The RL has a discount of 1.

Table 1: the results of the baseline model and the CREDIT we suggest on the MultiWOZ 2.0 and MultiWOZ 2.1 datasets. + BERT denotes that the tracking model uses a pre-trained BERT to encode the utterance. In the ITC column, M is the number of slots and N is the number of values. Denotes that the model has not yet published its source code, and that the model size is unknown. Joint acc. represents the accuracy of the Joint target.

5.4, results

Target accuracy combined: as shown in table 1, our proposed CREDIT achieves the best performance in the model without BERT, where CREDIT-SL uses only cross entropy loss, and CREDIT-RL is fine-tuned from pre-training by the RL method introduced in section 4. The credit SL. CREDIT-RL may further improve performance in both the MultiWOZ 2.0 and MultiWOZ 2.1 datasets. COMER is a close method of CREDIT, as described in section 2. The benefits of CREDIT compared to come from the structured state representation. It not only contains the structured information of the dialog state, but also a sequence that is easy to generate by a conventional coder-decoder model. In COMER, a state is a tree, and the resulting values of different slots are independent of each other. This can be a serious problem in the state tracking task. Particularly in the MultiWOZ dataset, the slot origin and destination have many identical values. The tree state represents the fact that deviation values are not different from the target values, but the structured state may not be captured.

ITC: as the number of slots increases, the inference time becomes an important factor depending on the reaction speed of the dialog system. Our proposed CREDIT directly generates all possible domain slot value tuples without having to access all slots of the ontology definition. Therefore, the inference time complexity of CREDIT is O (1). In contrast to SOM-DST, whose ITC is also O (1), its input needs to include all slot value pairs of the previous dialog state. In other words, this approach reduces the complexity of the inference time by increasing the complexity of the memory storage. However, our proposed CREDIT does not increase the complexity of the inference time and the complexity of the memory storage as the number of slots increases.

Influence analysis: fig. 5a shows the impact of dialog length on the joint accuracy in the present invention, and fig. 5b shows the impact of ground truth length on the joint accuracy. In fig. 5a and 5b we verify the effect of dialog length and ground truth length. The ground truth length represents the number of field slot value tuples in the dialog state. The session length is the length of the number of turns of the session. The state curve and the architecture curve respectively represent joint state accuracy (join state accuracy) and joint architecture accuracy (join sketch accuracy). It can be seen that as the dialog length or state length increases, both joint accuracies decrease rapidly. The two trends of the joint accuracy are completely consistent.

Table 2: results of ablation experiments on the MultiWOZ 2.1 dataset, (-) indicates deletion or modification of the corresponding component.

5.5 ablation experiment

Table 2 shows the results of the ablation study. We verified the impact of three factors: coarse grain decoder, replication mechanism used in fine state decoder and layered statement encoder.

Effect of the coarse decoder: in this validation study, we deleted the coarse-grained decoder and coarse-grained encoder from CREDIT. When generating the final state sequence in the fine state decoder, we use only the previously predicted token y_t-1Is taken as an input rather than being calculated from equation 12. No architecture information may guide the state generation process. As shown in Table 2, without the coarse decoder, the performance is significantly degraded. It illustrates that a coarse-to-fine generation mechanism can simplify the generation process of the structured state, while the architectural information facilitates the generation of the fine state.

Effects of replication mechanisms: in this study, we directly used the probability of generation P calculated by equation 15_gAs the final output distribution in the fine state decoder. In the dialog state tracking task, the replication mechanism is an important method for alleviating the invisible value problem. In addition, the replication mechanism may also direct the model to place emphasis on the dialog context information. As shown in table 2, we can see that the replication mechanism has a large impact on the final performance.

Effects of the layered encoder: to verify the effect of the layered encoder, we concatenate all dialog statements into one sequence and encode it using a bidirectional GRU. In the fine-state decoder model, there is only one decoder-encoder attention step. In this case we can see the largest performance degradation. This suggests that the layered coding scheme can provide more statement information in conjunction with the two-stage attention approach.

6. Conclusion

In this work, we redefine the dialog states as a structured state sequence. We solve the DST task as a dialog-level semantic parsing task and simplify the state generation process with coarse to fine generation. With a well-designed CREDIT model, we obtained encouraging results in a multi-domain DST task. Furthermore, we first try to fine-tune the generated DST model using reinforcement learning methods and improve performance.

It should be noted that for simplicity of explanation, the foregoing method embodiments are described as a series of acts, but those skilled in the art will recognize that the present invention is not limited by the illustrated ordering of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art will appreciate that the embodiments described in this specification are presently preferred and that no acts or modules are required by the invention. In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In some embodiments, the present invention provides a non-transitory computer-readable storage medium, in which one or more programs including executable instructions are stored, where the executable instructions can be read and executed by an electronic device (including but not limited to a computer, a server, or a network device, etc.) to perform any one of the above-described dialog state tracking method and/or man-machine dialog method of the present invention.

In some embodiments, the present invention further provides a computer program product comprising a computer program stored on a non-volatile computer-readable storage medium, the computer program comprising program instructions that, when executed by a computer, cause the computer to perform any of the above-described dialog state tracking methods and/or human-machine dialog methods.

In some embodiments, an embodiment of the present invention further provides an electronic device, which includes: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a dialog state tracking method and/or a human-machine dialog method.

In some embodiments, an embodiment of the present invention further provides a storage medium having a computer program stored thereon, wherein the program, when executed by a processor, implements a dialog state tracking method and/or a man-machine dialog method.

Fig. 6 is a schematic diagram of a hardware structure of an electronic device for performing a dialog state tracking method according to another embodiment of the present application, where as shown in fig. 6, the electronic device includes: one or more processors 610 and a memory 620, with one processor 610 being an example in fig. 6. The apparatus for performing the dialog state tracking method may further include: an input device 630 and an output device 640.

The processor 610, the memory 620, the input device 630, and the output device 640 may be connected by a bus or other means, such as the bus connection in fig. 6.

The memory 620, which is a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules, such as program instructions/modules corresponding to the dialog state tracking method in the embodiments of the present application. The processor 610 executes various functional applications of the server and data processing by running non-volatile software programs, instructions and modules stored in the memory 620, so as to implement the dialog state tracking method of the above method embodiment.

The memory 620 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created from use of the dialog state tracking device, and the like. Further, the memory 620 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the memory 620 may optionally include memory located remotely from the processor 610, which may be connected to the dialog state tracking device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

Input device 630 may receive entered numeric or character information and generate signals related to user settings and function controls of the dialog state tracking device. The output device 640 may include a display device such as a display screen.

The one or more modules are stored in the memory 620 and, when executed by the one or more processors 610, perform the dialog state tracking method of any of the method embodiments described above.

The product can execute the method provided by the embodiment of the application, and has the corresponding functional modules and beneficial effects of the execution method. For technical details that are not described in detail in this embodiment, reference may be made to the methods provided in the embodiments of the present application.

The electronic device of the embodiments of the present application exists in various forms, including but not limited to:

(1) mobile communication devices, which are characterized by mobile communication capabilities and are primarily targeted at providing voice and data communications. Such terminals include smart phones (e.g., iphones), multimedia phones, functional phones, and low-end phones, among others.

(2) The ultra-mobile personal computer equipment belongs to the category of personal computers, has calculation and processing functions and generally has the characteristic of mobile internet access. Such terminals include PDA, MID, and UMPC devices, such as ipads.

(3) Portable entertainment devices such devices may display and play multimedia content. Such devices include audio and video players (e.g., ipods), handheld game consoles, electronic books, and smart toys and portable car navigation devices.

(4) The server is similar to a general computer architecture, but has higher requirements on processing capacity, stability, reliability, safety, expandability, manageability and the like because of the need of providing high-reliability service.

(5) And other electronic devices with data interaction functions.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a general hardware platform, and certainly can also be implemented by hardware. Based on such understanding, the above technical solutions substantially or contributing to the related art may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present application.

Claims

1. A dialog state tracking system, comprising:

a coarse-grained decoder configured to decode according to a hierarchical encoding result of the hierarchical statement encoder to obtain a coarse-grained dialog state, the coarse-grained dialog state being composed of a domain-slot pair mentioned in the dialog history statement such that no slot value needs to be copied from a dialog;

a state decoder configured to determine a serialized dialog state based on the hierarchical encoding result and the coarse-grained encoding result, for which state decoder specific slot values can be copied directly from the dialog history statement, two levels of decoder-to-encoder attention in the state decoder: the method comprises the steps of sentence-level attention with sentence vectors and word-level attention with word vectors, wherein the sentence-level attention obtains conversation fields and corresponding grooves mentioned in conversation historical sentences, and the word-level attention extracts specific values of relevant grooves in the conversation historical sentences.

2. The system of claim 1, wherein the hierarchical sentence coder comprises a sequentially connected word-level coder and sentence-level coder, wherein,

the statement-level encoder is configured to perform statement-level encoding on the word-level encoding results output by the word-level encoder, and output the statement-level encoding results to the coarse-granularity decoder and the state decoder.

3. The system of claim 2, wherein,

the word-level encoder is configured with a first memory for storing an attention mechanism for a word-level decoding-encoding model, the first memory being connected to the state decoder;

the sentence-level encoder is configured with a second memory for storing a mechanism of attention for a sentence-level decode-encode model, the second memory coupled to the coarse-granularity encoder and the state decoder.

4. The system of claim 2 or 3, wherein the word-level encoder, the sentence-level encoder, and the coarse-level encoder comprise bidirectional GRU units, and the coarse-level decoder and the state decoder comprise unidirectional GRU units.

5. A human-machine dialog device comprising a dialog state tracking system according to any of claims 1 to 4.

6. A dialogue state tracking method applied to the dialogue state tracking system of claim 1, the method comprising:

the coarse-grained decoder decodes according to the hierarchical coding result of the hierarchical statement coder to obtain a coarse-grained dialogue state, and the coarse-grained dialogue state is composed of the field-slot pairs mentioned in the dialogue historical statement so that no slot value needs to be copied from the dialogue;

the coarse-grained coder carries out coarse-grained coding on the coarse-grained dialogue state;

the state decoder determines a serialized dialog state from the layered coding result and the coarse-grained coding result, for which state decoder specific slot values can be copied directly from the dialog history statement, two levels of decoder-to-encoder attention being in the state decoder: the method comprises the steps of sentence level attention with sentence vectors and word level attention with word vectors, wherein the sentence level attention obtains conversation fields and corresponding grooves mentioned in the conversation historical sentences, and the word level attention extracts specific values of relevant grooves in the conversation historical sentences.

7. The method of claim 6, wherein the hierarchical sentence coder comprises a sequentially connected word-level coder and sentence-level coder, wherein,

8. The method of claim 7, wherein the word-level encoder is configured with a first memory for storing an attention mechanism of a word-level decoding-encoding model, the first memory being connected to the state decoder; the sentence-level encoder is configured with a second memory for storing a mechanism of attention of a sentence-level decoding-encoding model, the second memory being connected to the coarse-granularity encoder and the state decoder;

the coarse-grained decoder decodes according to the hidden state of the statement-level encoder and the attention mechanism of the statement-level decoding-encoding model to obtain a coarse-grained conversation state;

and the state decoder determines a serialized conversation state according to the attention mechanism of the word-level decoding-coding model, the attention mechanism of the sentence-level decoding-coding model, the hidden state of the sentence-level coder and the coarse-grained coding result.

9. The method of claim 7 or 8, wherein the word-level coder, the sentence-level coder, and the coarse-level coder comprise bidirectional GRU units, and the coarse-level decoder and the state decoder comprise unidirectional GRU units.

10. A human-machine dialog method comprising the steps of the dialog state tracking method of any of claims 6 to 9.

11. An electronic device, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method of any one of claims 6-10.

12. A storage medium on which a computer program is stored which, when being executed by a processor, carries out the steps of the method of any one of claims 6 to 10.