CN114281954A

CN114281954A - Multi-round dialog reply generation system and method based on relational graph attention network

Info

Publication number: CN114281954A
Application number: CN202111044215.XA
Authority: CN
Inventors: 林菲; 钱朝辉; 张聪
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2021-09-07
Filing date: 2021-09-07
Publication date: 2022-04-05

Abstract

The invention belongs to the field of computer artificial intelligence generated by natural language, and discloses a multi-round dialog reply generation system and method based on a relational graph attention network. The method comprises the following steps: acquiring multi-turn conversation input content for preprocessing, acquiring semantic information representation of each turn of the speeches, and encoding the semantic information of each turn of the speeches so as to obtain semantic representation of conversation context; then capturing autocorrelation in multiple rounds of conversations and correlation characteristics among the interlocutors by adopting a graph attention network, and introducing a relationship position code in the graph attention network to explain sequence information containing the utterances, so as to obtain a high-level semantic representation of a graph coding layer; and finally, taking the dialogue context semantic information representation and the high-level semantic representation of the relation graph attention coding as input, and decoding by using a GRU model to generate a final dialogue reply output representation. The invention obviously improves the quality of the multi-turn dialog reply generation, so that the generated reply is more coherent and meaningful.

Description

Multi-round dialog reply generation system and method based on relational graph attention network

Technical Field

The invention belongs to the field of computer artificial intelligence generated by natural language, and particularly relates to a multi-round dialog reply generation system and method based on a relational graph attention network.

Background

With the explosion of the internet and the rapid development of social media, a large amount of user dialogue corpora are generated, which provides conditions for a data-driven dialogue system. The enormous research and commercial value of intelligent dialog systems is receiving more and more research attention from both academic and industrial circles. At present, a dialog system can be divided into a task-driven limited-domain dialog system and an open-domain dialog system without a specific task, and compared with the former, the latter has the characteristics of better practicability, expandability, domain adaptability and the like, so that the open-domain dialog system gradually becomes a focus of attention of researchers.

Currently, the methods implemented according to the system can be divided into a search model and a generation model. The retrieval model adopts a selection algorithm to select proper replies from the dialogue corpus, and although the replies have the characteristics of correct grammar, objective fact and the like, the problems of reply sentence singleness, topic limitation and the like exist. The generative model is different from the generative model, and the natural language processing technology is adopted to learn and understand the context information input by the user and then gradually generate the corresponding reply words. Meanwhile, the generative model may be divided into a single-round dialogue generative model and a multi-round dialogue generative model according to whether or not the history dialogue information is considered. Compared with a single-round dialog generation model, a multi-round dialog generation model requires a system to have the ability to understand complex context information, and thus is more challenging. However, the research in the multi-turn dialogue generation model field in the current open field still has many difficulties, such as general reply, lack of background knowledge, lack of consistency, and the like. Therefore, the improvement of the open-field multi-turn dialog generation system has great research value.

Recent researchers in this field mainly develop research work based on sequence-to-sequence frameworks, and the research content mainly focuses on how to effectively model the problem of context information. However, previous research work rarely considers utterance dependencies between interlocutors and timing information. To solve this problem, how to model the speech dependency between the interlocutors is the key, and it is also necessary to pay attention to the timing information of the interlocutor speech. The invention adopts a relational graph attention network algorithm to model the dependence relationship of the words among the interlocutors, and fully excavates the information among different relationship types of the words sent by the interlocutors; meanwhile, the relation position coding is introduced to capture sequence information contained between the utterances of the interlocutors, so that the generated reply is more coherent, natural and specific.

Disclosure of Invention

The present invention provides a system and a method for generating a multi-turn dialog reply based on a graph attention network, so as to solve the above technical problems.

In order to solve the technical problems, the specific technical scheme of the multi-round dialog reply generation system and method based on the attention network of the relational graph is as follows:

a multi-round dialog reply generation system based on a relational graph attention network comprises a sentence input coding layer, a graph coding layer and a decoding layer, wherein the sentence input coding layer comprises a word level encoder and a speech level encoder; the word-level encoder encodes words in each turn of the utterance input by the model, thereby obtaining a semantic representation of the turn of the utterance itself; the utterance level encoder encodes the utterance's own semantic representation of the model, thereby obtaining a semantic representation of the entire dialog context; the graph coding layer firstly captures autocorrelation of the words in multiple rounds of conversations and correlation characteristics among the interlocutors by adopting a graph attention network, and secondly introduces a relation position code to explain sequence information containing the words; the decoding layer generates a response reply based on the input contextual semantic information representation and the high level semantic representation of the graph encoding layer.

The invention also discloses a multi-round dialog reply generation method based on the relational graph attention network, which comprises the following steps:

the method comprises the following steps: acquiring multi-turn dialog input contents for preprocessing, converting word meaning information in each turn of the utterances into corresponding vector representation through a pre-trained BERT model so as to obtain semantic information representation of each turn of the utterances, and encoding the word meaning information of each turn of the utterances through a Bi-GRU model so as to obtain semantic representation of a dialog context;

step two: capturing autocorrelation in multiple rounds of conversations and correlation characteristics among interlocutors by adopting a graph attention network, and introducing relationship position coding in the graph attention network to explain sequence information containing utterances, so as to obtain a high-level semantic representation of a graph coding layer;

step three: the dialog context semantic information representation and the high-level semantic representation of the relational graph attention coding are used as input, and a GRU model is used for decoding to generate a final dialog reply output representation.

Further, in the first step, in the preprocessing of the input content of the multi-turn dialog, the obtaining of the semantic information representation process of each turn of the utterance itself includes: the method comprises the steps of coding words in each input round of words, firstly, performing sequence labeling representation on each round of words by adopting a BPE algorithm, and then inputting the words into a pre-training language BERT model for fine-tuning learning training, so that semantic information representation of each round of words is obtained.

Further, step 1 encodes with a word-level encoder and a speech-level encoder; the method comprises the following specific steps:

the word-level encoder sets { U } for a given length M of a multi-turn dialog context U₁，...，u_MFirstly, each round of utterances is subjected to sequence marking representation by adopting BPE algorithm

Wherein T is_iMarking the number of the i-th turn of utterances, and inputting the number of the i-th turn of utterances into a pre-training language BERT model for fine-tuning learning training, wherein the encoding process of the word-level encoder is represented by the following formula:

thereby obtaining semantic information representation of each turn of speech

A Bi-GRU model is adopted as a coding mode of a speech level coder; the Bi-GRU model firstly obtains the self semantic representation of each turn of utterances from the upper layer word level encoder

As input, each round of utterances is encoded by a Bi-GRU model, the utterance-level encoder encoding process being represented by the following formula:

wherein

For the i-th hidden layer representation in the forward GRU,

is represented by the ith hidden layer in the backward GRU; the hidden layer representations of the forward GRU and the backward GRU are spliced to obtain semantic representation containing context information

High-level feature information between multiple turns of conversational utterances is captured through a hierarchical structure.

Further, the obtaining of the high-level semantic representation of the graph coding layer in the second step includes the following steps:

constructing a directed graph for M sentences in a multi-turn conversation and defining the following steps:

defining each sentence in a plurality of rounds of conversations as a node v_i，

The relationship dependency information between each statement is defined as an edge r,

wherein

The weight of an edge is defined as alpha_ijr，

(1) Firstly, the context semantic output by the context coding layer is expressed

As node v_iAn initial vector representation of;

(2) constructing an information edge r based on the nodes, and carrying out the following 4 differentiation definitions on the type of the information edge r: (a) self-before type edge r₁Target utterance and the information of the utterance relation dependency type before the utterer of the utterance; (b) inter-before type edge r₂Target speech and previous speech relation dependency type information except the person who sends the speech; (c) self-after type edge r₃Target utterance and utterance relation dependency type information behind the utterance speaker; (d) inter-after type edge r₄Target utterance and subsequent utterance relationship dependency type information other than the utterance speaker;

(3) capturing timing information between the four types of information side utterances using a method of Relational Position encoding (Relational Position Encodings); the relational position coding process is expressed by the following formula:

wherein PE_ijrRepresenting a target utterance u under a relationship type r_iIts neighboring words u_jThe maximum relation value is between [ b, a]B and a are sliding window values of the target speech and other speech,

representing a target utterance u under a relationship type r_iA neighborhood of (c);

(4) and calculating the weight of the related information edge, wherein the formula is as follows:

wherein alpha is_ijrRepresenting a target utterance u under a relationship type r_iIts neighboring words u_jEdge weight value between, W_rFor the parameterized weighting matrix in the attention mechanism, a_rFor parameterized weight vector, ·^TRepresenting transposition, and LRL is LeakyReLU activation function;

(5) by aggregating neighborhoods

To update the vector representation of each node

The graph propagation mechanism encoding process is represented by the following formula:

wherein

For trainable parameter weighting matrix, L is convolution layer number, and finally high-level semantic representation of image coding layer is obtained through output

Further, the third step includes the following steps:

the GRU model is adopted as a decoder to generate the reply, and the decoding process of the decoder is represented by the following formula:

wherein s is₀For the initialization input of the decoder, W_eAnd b_eIn order to train the parameters, the user may,

a concatenated representation of the last hidden layer for forward and backward GRUs in the speech-level encoder; s_tFor the hidden layer representation of the decoder at time t, e (r)_t-1) The word vector representation of the word is output for time t-1,

high-level semantic representation output at an L layer for a t-1 moment graph coding layer;

finally, according to high-level semantic information of image coding layer

Combined with a hidden layer representation s of the decoder at time t_tThe output of the current time is predicted and is represented by the following formula:

wherein W_o，b_oFor trainable parameters, p represents the probability of generating a word at the current time;

replying to R ═ R with a given real conversation₁，r₂，...，r_T]For training the target, a cross entropy loss function is used

Training model parameters:

the multi-round dialog reply generation system and method based on the relation graph attention network have the following advantages that:

1. compared with the traditional single-round dialogue reply generation, the multi-round dialogue reply generation method can capture high-level feature information among multi-round dialogue utterances through a hierarchical structure, so that the generated reply information has higher correlation and more diversity.

2. The invention adopts the pre-training language model BERT to learn a better text characteristic through a deep model, thereby effectively solving the problem of word ambiguity.

3. The method uses a relational graph attention network model to capture the interdependence relation among text sequences by constructing nodes, edges and corresponding topological structures, thereby further extracting potential feature representation.

Drawings

FIG. 1 is a block diagram of a system for generating a multi-turn dialog reply based on a graph attention network according to the present invention;

FIG. 2 is a conceptual diagram of the relationship location of the present invention;

FIG. 3 is a diagram of the process of position encoding for four different relationship types of the present invention.

Detailed Description

In order to better understand the purpose, structure and function of the present invention, a multi-turn dialog reply generation system and method based on a graph attention network according to the present invention will be described in further detail with reference to the accompanying drawings.

As shown in fig. 1, the method for generating a multi-turn dialog reply based on a relational graph attention network of the present invention includes the following steps:

As a preferred embodiment of the present invention, in the first step, in preprocessing input contents of multiple rounds of dialog, obtaining semantic information representing process of each round of utterance itself includes: the method comprises the steps of coding words in each input round of words, firstly, performing sequence labeling representation on each round of words by adopting a BPE algorithm, and then inputting the words into a pre-training language BERT model for fine-tuning learning training, so that semantic information representation of each round of words is obtained.

The invention discloses a relational graph attention network-based multi-round dialogue reply generation system framework schematic diagram, which is shown in FIG. 1, wherein the whole framework comprises the following parts: a sentence input encoding layer, a picture encoding layer, and a decoding layer. The various components of the system will now be described in detail:

(1) sentence input coding layer

The sentence input coding layer comprises two different level coders: a word-level encoder and a speech-level encoder. Both encoders are described in detail below.

A word level encoder: the encoder is intended to encode the words in each turn of the utterance input by the model, so as to obtain a semantic representation of the turn of the utterance itself. For a given length M of the multi-turn dialog context U ═ U₁，...，u_MFirstly, in order to fully extract the information expressed by the user words, the model adopts BPE algorithm for each wordSequence tagged representation of a round-robin utterance

Wherein T is_iThe number of the i-th round of utterances is marked, and then the number of the i-th round of utterances is input into a pre-training language BERT model for fine-tuning learning training. The encoder encoding process is represented by the following formula:

thereby obtaining semantic information representation of each turn of speech

Speech level encoder: the encoder is intended to encode the utterance's own semantic representation of the model, thus obtaining a semantic representation of the entire dialog context. The Bi-GRU model is used herein as the encoding mode of the speech level encoder. The model firstly obtains the self semantic representation of each turn of utterances from the upper-layer word-level encoder

As input, each round of utterances is encoded by a Bi-GRU. The encoder encoding process is represented by the following formula:

wherein

For the i-th hidden layer representation in the forward GRU,

is a rear directionThe ith hidden layer in the GRU. The hidden layer representations of the forward GRU and the backward GRU are spliced to obtain semantic representation containing context information

(2) Layer of picture coding

The encoding layer first captures auto-correlation of utterances in multiple rounds of dialog and correlation features between speakers using a graphical attention network. In addition, the model introduces a completely new position code (namely, a relation position code) in the attention network to explain the sequence information containing the speech.

Different from the traditional multi-turn dialogue reply generation research work, the method constructs a directed graph for M sentences in the multi-turn dialogue and defines the following steps:

wherein

The weight of an edge is defined as alpha_ijr，

The node representation, edge type definition, relationship location coding, edge weight representation, and graph propagation mechanism in the graph will be described in detail below.

In the graph attention network, the model firstly represents the context semantic output by the context coding layer

As node v_iIs represented by the initial vector of (a). As shown in fig. 1, in order to fully show the different relationship dependency information between utterances, the information edge shown in the network coding layer is illustrated, and the type of the information edge r is defined by 4 differentiation as follows: (a) self-before type edge r₁Target utterance and the information of the utterance relation dependency type before the utterer of the utterance; (b) inter-before type edge r₂Target speech and previous speech relation dependency type information except the person who sends the speech; (c) self-after type edge r₃Target utterance and utterance relation dependency type information behind the utterance speaker; (d) inter-after type edge r₄A target utterance and subsequent utterances other than the utterer of the utterance are related to the dependency type information. Furthermore, a method of Relational Position Encodings is proposed herein to capture the temporal information between these four types of information-side utterances. Unlike previous absolute position coding and relative position coding, which codes under a relationship type based on the relative distance between utterances, fig. 2 illustrates the concept of a relationship position, where the third row illustrates different background colors, i.e. different types representing information edges, in the relationship position. Then, position coding is performed based on the four different relationship types, and the coding information is added to the weight of the edge, and fig. 3 shows the coding process. Thus, the relational position encoding process can be represented by the following formula:

is represented in relation typer target utterance u_iOf the neighborhood of (c). Inspired by the attentive network model, combined with the above-mentioned relational position coding, the weights of the edges are thus defined by the following equations:

wherein alpha is_ijrRepresenting a target utterance u under a relationship type r_iIts neighboring words u_jEdge weight value between, W_rFor the parameterized weighting matrix in the attention mechanism, a_rFor parameterized weight vector, ·^TRepresenting transpose, LRL is the LeakyReLU activation function.

The graph coding layer finally passes the aggregation neighborhood

To update the vector representation of each node

wherein

L is the number of convolution layers for the trainable parametric weighting matrix. Finally, high-level semantic representation of the image coding layer is obtained through output

(3) Decoding layer

The decoding layer is intended to generate a response reply based on the input contextual semantic information representation and the high level semantic representation of the graph coding layer. The GRU model is employed herein as a decoder to generate the reply. The decoder decoding process is represented by the following equation:

a concatenated representation of the last hidden layer for the forward and backward GRUs in the speech-level encoder. s_tFor the hidden layer representation of the decoder at time t, e (r)_t-1) The word vector representation of the word is output for time t-1,

and (4) outputting a high-level semantic representation at an L layer for the t-1 moment graph coding layer.

Finally, according to high-level semantic information of image coding layer

Combined with a hidden layer representation s of the decoder at time t_tTo predict the output at the current time. Represented by the following formula:

wherein W_o，b_oFor trainable parameters, p represents the probability that a word is generated at the current time.

The present invention replies with a given real dialog R ═ R₁，r₂，...，r_T]For training the target, a cross entropy loss function is used

Training model parameters:

the invention is verified in two open source data sets Ubuntu and Dailydialog, and the realization result is shown in the following table:

as can be seen from the table, the evaluation indexes of the method of the invention on two data sets are basically superior to those of other baseline models, and the effectiveness of the attention network method based on the relational graph provided by the invention is verified. Wherein the method of the invention is significantly higher than all baseline models in the PPL, BLEU and BERTScore indexes, which shows that the reply information generated by the method is more relevant and diversified. Meanwhile, compared with the Ours-GCN method, the model provided by the invention is higher than the traditional GCN model in indexes, and the attention mechanism introduced into the graph network layer can effectively capture information between statement relation dependencies; compared with the Ours-NPE method, the method of the invention has greatly improved performance on indexes, and proves the importance of the relation position coding.

It is to be understood that the present invention has been described with reference to certain embodiments, and that various changes in the features and embodiments, or equivalent substitutions may be made therein by those skilled in the art without departing from the spirit and scope of the invention. In addition, many modifications may be made to adapt a particular situation or material to the teachings of the invention without departing from the essential scope thereof. Therefore, it is intended that the invention not be limited to the particular embodiment disclosed, but that the invention will include all embodiments falling within the scope of the appended claims.

Claims

1. A multi-round dialog reply generation system based on a relational graph attention network comprises a sentence input coding layer, a graph coding layer and a decoding layer, and is characterized in that the sentence input coding layer comprises a word level coder and a speech level coder; the word-level encoder encodes words in each turn of the utterance input by the model, thereby obtaining a semantic representation of the turn of the utterance itself; the utterance level encoder encodes the utterance's own semantic representation of the model, thereby obtaining a semantic representation of the entire dialog context; the graph coding layer firstly captures autocorrelation of the words in multiple rounds of conversations and correlation characteristics among the interlocutors by adopting a graph attention network, and secondly introduces a relation position code to explain sequence information containing the words; the decoding layer generates a response reply based on the input contextual semantic information representation and the high level semantic representation of the graph encoding layer.

2. A method for multiple dialog reply rounds generation using the system of claim 1, comprising the steps of:

3. The method for generating multi-turn dialog replies based on the attention network of the relational graph as claimed in claim 2, wherein in the step one, preprocessing the input content of the multi-turn dialog, and obtaining the semantic information representation process of each turn of the utterance comprises: the method comprises the steps of coding words in each input round of words, firstly, performing sequence labeling representation on each round of words by adopting a BPE algorithm, and then inputting the words into a pre-training language BERT model for fine-tuning learning training, so that semantic information representation of each round of words is obtained.

4. The relational graph attention network-based multi-turn dialog reply generation method according to claim 3, wherein step 1 is encoded using a word-level encoder and a speech-level encoder; the method comprises the following specific steps:

thereby obtaining semantic information representation of each turn of speech

wherein

For the i-th hidden layer representation in the forward GRU,

5. The method for generating dialog replies according to claim 4, wherein the obtaining of the high-level semantic representation of the graph coding layer in step two comprises the following steps:

will have multiple rounds of conversationEach statement in (c) is defined as a node v_i，

The relationship dependency information between each statement is defined as an edge r, (v)_i，r，v_j) E epsilon, wherein

The weight of an edge is defined as alpha_ijr，

As node v_iAn initial vector representation of;

(5) by aggregating neighborhoods

To update the vector representation of each node

wherein

For trainable parametric weighting matrices, L is the number of convolution layersFinally, the high-level semantic representation of the graph coding layer is output

6. The method for generating the dialog replies based on the attention network of the relationship graph as claimed in claim 5, wherein said step three comprises the following steps:

finally, according to high-level semantic information of image coding layer

Training model parameters: