CN116010575A

CN116010575A - Dialogue generation method integrating basic knowledge and user information

Info

Publication number: CN116010575A
Application number: CN202310058399.8A
Authority: CN
Inventors: 覃远年; 黎桂成; 吴冬雪; 雷送强; 宁波; 卢玉胜
Original assignee: Guilin University of Electronic Technology
Current assignee: Guilin University of Electronic Technology
Priority date: 2023-01-19
Filing date: 2023-01-19
Publication date: 2023-04-25

Abstract

The invention discloses a dialogue generating method integrating basic knowledge and user information, which comprises the following steps: constructing a user information data set and a man-machine conversation data set, acquiring a basic knowledge data set from a large data platform, sending the data set to an encoder and a decoder adopting a multi-input transducer structure, respectively encoding and calculating attention vectors for historical conversations, user personal information and basic knowledge, and then linearly fusing all the attention vectors, so that the three parts of contents are more comprehensively considered by a language model to generate more reasonable replies. The method gives consideration to processing knowledge and role information, can improve the experience of man-machine conversation and improve the quality of language reply.

Description

Dialogue generation method integrating basic knowledge and user information

Technical Field

The invention relates to the technical field of artificial intelligence natural language generation, in particular to a dialogue generation method integrating basic knowledge and user information.

Background

In recent years, man-machine intelligent dialogue technology has been vigorously developed, and has been used in many fields, such as online customer service, online medical consultation, psychological consultation, automatic question-answering, etc., and artificial intelligence has been widely used in dialogue generation for providing information to people. However, most people have impressions on AI intelligent man-machine conversations, and stay in 'Hey Siri' and 'small scale', the conversational partner chat is often a dialect which is not spoken by a user, only can react to specific instructions, and the reactive conversation is also prefabricated, so that the requirements of various users cannot be met. According to the mode of human conversation, when people provide information to others, the background and interests of the other party are considered because the interests of different people are different. Providing a large amount of knowledge without regard to the dialog object's own information may give the other party too much garbage, resulting in a reduced user experience. In the context of such conversations in humans, dialog systems need to combine a priori knowledge with character information to give a reply to provide information to the user more effectively. The existing data set and dialogue generation model rarely considers knowledge and role information, and has limitation in generating dialogue integrating knowledge and role information.

The invention comprises the following steps:

the invention aims at overcoming the defects of the prior art and provides a dialogue generating method integrating basic knowledge and user information. The method gives consideration to processing knowledge and role information, can improve the experience of man-machine conversation and improve the quality of language reply.

The technical scheme for realizing the aim of the invention is as follows:

a dialogue generating method integrating basic knowledge and user information comprises the following steps:

1) Building a dialogue data set based on knowledge and role information: because the existing dialogue data set is not fusedCombining knowledge and role information requires constructing a training dataset having both knowledge and role information, comprising: data set d= [ D ] based on basic knowledge database DBpedia ₁ ,d ₂ ,...,d _n ]User characteristic information data set p= (P) ₁ ,p ₂ ,....,p _n ) Human-computer interaction is carried out by adopting sentences with role information labels to obtain dialogue data, wherein the dialogue data comprises question sentences and reply sentences

l _m Representing a question sentence @, a->

Representing a reply sentence;

2) Obtaining a user information embedding vector and a basic knowledge embedding vector: mapping natural language to vector space by using a multi-input transducer structure through word embedding and embedding, respectively carrying out word embedding and encoding on basic knowledge sequence D and user characteristic information P to respectively obtain word embedding sequence vectors X (D) and X (P), converting the word embedding vectors into sine and cosine vector representations containing various frequencies by position encoding, and capturing relations among words in a high-dimensional vector space to obtain the word embedding sequence vector X with position information _embed(D) and X_embed (P)：

X _embed ＝Embedding+PositionalEncodering，

Embedding(D)＝D·W ^d ，Embedding(P)＝P·W ^p ，

Wherein, the symbol (·) represents word embedded code, W ^d 、W ^p Representing a learnable parameter, locationencoding (), representing a line employing a sine and cosine functionThe character transformation encodes the word embedding vector, T represents any position in the word embedding sequence, d represents vector dimension, and positional encoding _(T,2i) Position coding of word embedding sequence T in 2i dimension _(T,2i+1) Representing the position coding of the word embedding sequence T in 2i+1 dimension, aiming at the word embedding vector X obtained after the word embedding and the position coding are completed _embed (D) And X is _embed (P) obtaining a hidden characteristic sequence C of the source text by performing next calculation through an Encoder layer _E ；

3) And (3) calculating: the Encoder layer adopting the multi-input transducer structure encodes the user information and the basic knowledge embedded vector, and calculates the basic knowledge attention vector and the user information attention vector, specifically:

the Encoder layer consists of two sublayers, the first being a Multi-Headed self-focusing layer Multi-Headed Attention, the other being a feed-forward neural network FNN layer,

calculating a multi-head Attention vector Attention (·) according to the following calculation formula:

Q＝Linear(X _embed )＝X _embed *W _q ，

K＝Linear(X _embed )＝X _embed *W _k ，

V＝Linear(X _embed )＝X _embed *W _v ，

wherein Linear (·) represents a Linear transformation operation, Q, K, V represent a query vector sequence, a key vector sequence, and a value vector sequence, W, respectively _q 、W _k 、W _v Respectively different matrix of the learnable parameters, and W _q W _k W _v ∈R ^d R represents a real number, d is a word embedding vector X _embed Is the dimension of (1), softmax represents the normalized exponential function, K ^T Is a transpose of K, with multiple heads representing multiple different parameter matrices W ⁱ To learn multiple meaning expressions, for X _embed Making linear mappingProjection onto each feature space i, multiplying with three weights W _q 、W _k 、W _v ，

The multi-headed attention layer merges knowledge from the same attention pooling, which is derived from h different feature spaces of the same query matrix Q, key matrix K and value matrix V, specifically denoted as h sets of different parameter matrices

The h groups of transformed queries, keys and values are attention-pooled in parallel, the h groups of attention-pooled output heads _i Spliced together and passed through another learnable linear projection matrix W _h Transform to produce final base knowledge multi-headed attention output X _att (D):

X _att (D)＝MultiHead(Q,K,V)＝Concat(head ₁ ；head ₂ ；……；head _h )W _h ，

Wherein MultiHead represents a multi-head attention function, concat represents a cascade operation, head _i The attention output representing the ith subspace, i e (1, h);

parameter matrix representing query vector Q in the ith feature space,/and method for generating the same>

Parameter matrix representing key vector K in the ith feature space,/and method for generating the same>

Parameter matrix representing the i-th feature space median vector V, i E (1, h), W _h Representing a learnable linear projection matrix for a user information attention vector X _att Calculation of (P) and the above-mentioned referenceBasic knowledge attention vector X _att (D) Is the same as calculated;

4) Acquiring a source text hidden representation: the knowledge attention vector and the character information attention vector are fused, and the hidden representation C of the source text is obtained through a forward neural network _E The method specifically comprises the following steps:

the basic knowledge attention feature vector X calculated in the step 3) is calculated _att (D) Attention feature vector X with user information _att (P) Linear fusion, output encoder fusion attention vector X _hidden ：

X _hidden ＝Linear{X _att (D)；X _att (P)}，

Linear (;) represents the cascade operation, X output _hidden Through the feed-forward neural network FFN and then connected with a residual link, namely X _hidden And adding the output para position of the FFN, and finally carrying out LayerNorm layer normalization on the added result once, wherein the calculation formula is as follows:

FFN＝Linear{ReLU[Linear(X _hidden )]}，

C _E ＝LayerNorm(X _hidden +FFN)，

wherein LayerNorm represents layer normalization operations, FFN represents a dual layer fully connected network with ReLU as an activation function, encoder source text hiding representation C _E As input to the next module decoder;

5) Optimizing the language generation model: the Decoder adopting the multi-input transducer structure calculates the attention vector for the basic knowledge, the user information and the current state simultaneously, the three calculation results are linearly fused to obtain the fused attention characteristic representation, and the fused attention characteristic representation and the source text hiding representation C are combined _E Acquiring context hidden representation C _D Hiding representation C according to context _D Generating a reply text sequence Y, and defining a loss function to optimize a language generation model, wherein the method specifically comprises the following steps:

the Decoder of the multi-input transducer structure consists of three sublayers, wherein the first sublayer is a mask multi-head self-attention layer, the second sublayer is an encoder-Decoder attention layer, the third sublayer is a feedforward neural network layer, and the mask multi-head attention layer takes historical dialogue data as input to extract mask attention vectors, and the calculation formula is as follows:

wherein ,

representing a masked attention vector, X _embed (L) represents a word-embedded and position-encoded dialogue data vector, and the receiver encoder output C _E Extracting historical dialog attention feature vector E, E and base knowledge attention feature vector X as input to encoder-decoder multi-head attention layer _att (D) Attention feature vector X with user information _att (P) concealing representation C by linear fusion output decoder _D The calculation formula is as follows:

where E represents the historical dialog attention feature vector,

representing a learnable projection matrix, X _att (D) Representing a basic knowledge attention feature vector, X _att (P) represents the attention feature vector of the user information, < ->

Represents a fused attention vector fusing basic knowledge, user information and historical session information, and Linear (;) represents cascading operation, C _D The hidden representation of the decoder obtained after the layer normalization operation is represented, the feedforward neural network layer part is the same as the encoder part, namely a two-layer fully-connected network with residual error link is connected with a normalization layer, and the output C of the decoder _D I.e. the output of the last decoder base unit needs to be mapped via a linear transformation and Softmax function to the probability distribution of the predicted word at the next time instant, at a given encoder output C _E And the decoder outputs y at the previous time _t-1 Next, predicting a probability distribution P (Y) of the word at the current moment, and generating a functional expression of a probability representation P (Y) of the reply sequence text Y as follows:

Y＝(y ₁ ,y ₂ ,.....,y _t-1 ,y _t )，

P(Y)＝Softmax(FFN(C _D )+C _D )，

wherein Y represents a reply text generated by a model, softmax represents a normalized exponential function, FFN represents a double-layer fully-connected network taking ReLU as an activation function, maximum likelihood estimation is adopted as a loss function, the loss function aims at minimizing negative log likelihood of language generation modeling, user judgment and knowledge judgment, and the whole total loss function fully-connects all maximum likelihood estimation functions by adopting weight parameters:

Loss＝α*L _P +β*L _D ，

wherein m, n represents the number of samples; l (L) _P Representing a user-determined loss function, L _D A loss function representing knowledge decisions; p is p ⁱ An ith user sample representing the use of the model in prediction; d, d ⁱ Representing the ith knowledge used by the model in the prediction;

the prediction results of the model on user judgment and knowledge judgment are represented, loss represents a joint Loss function, and alpha and beta represent adjustable weight parameters.

The key processing procedure in the encoder in the technical scheme comprises the following steps: sequentially performing word embedding and position coding on basic knowledge and historical dialogues with personalized labels of users to obtain word embedding feature vectors X with position information _embed (D)、X _embed (P)，X _embed (D)、X _embed The input encoder calculates the multi-head self-attention vector to obtain corresponding characterization, and then the hidden characteristic representation C of the source text is obtained through feedforward full-connection layer, layer normalization and residual connection _E ，C _E As input to the decoder.

The key processing procedure in the decoder in the technical scheme comprises the following steps: semantic vector sequences obtained by word embedding and position coding of dialogue data are input into a decoder, a multi-head attention mechanism is adopted, and a hidden state vector C output by the encoder is used _E Obtaining an attention vector E of the history dialogue through multi-head attention calculation; calculating tertiary attention vectors aiming at personalized information, basic knowledge and historical dialogue states of a user respectively, linearly fusing the tertiary results, and extracting decoder hidden state representation C from the fused attention vectors _D ，C _D The generation statement Y is obtained through the feedforward neural network layer and the Softmax operation, the dialogue generation model is optimized by adopting the joint loss function, and the generation statement Y is obtained when the iteration of the joint loss function is minimumTraining a good model; and the user inputs the questioning content into the trained dialogue generating model to finally obtain a reply sentence with fused user information and basic knowledge.

The technical scheme improves a conventional transducer coding and decoding layer, adds a multi-layer input attention structure on an original single input structure to generate basic knowledge response and user information response, adopts a basic knowledge base and user information portraits to generate customized replies, and is favorable for accurately grasping user core requirements.

According to the technical scheme, the language generating model is trained by adopting the Encoder-Decoder structure based on the multi-input transducer, so that the language generating model can reply aiming at the user image combining basic knowledge, and the quality of language reply is improved.

The method gives consideration to processing knowledge and role information, can improve the experience of man-machine conversation and improve the quality of language reply.

Description of the drawings:

FIG. 1 is a schematic flow chart of a method of an embodiment;

FIG. 2 is a schematic diagram of an Encoder Encoder in an embodiment;

fig. 3 is a schematic diagram of a Decoder in an embodiment.

The specific embodiment is as follows:

the present invention will now be further illustrated, but not limited, by the following figures and examples.

Examples:

referring to fig. 1, a dialogue generating method for fusing basic knowledge and user information includes the steps of:

1) Building a dialogue data set based on knowledge and role information: because the existing dialogue data set does not integrate knowledge and role information, a training data set with knowledge and role information at the same time needs to be constructed, and the training data set comprises:data set d= [ D ] based on basic knowledge database DBpedia ₁ ,d ₂ ,...,d _n ]User characteristic information data set p= (P) ₁ ,p ₂ ,....,p _n ) Human-computer interaction is carried out by adopting sentences with role information labels to obtain dialogue data, wherein the dialogue data comprises question sentences and reply sentences

l _m Representing a question sentence @, a->

Representing a reply sentence;

2) Obtaining a user information embedding vector and a basic knowledge embedding vector: as shown in fig. 2, a multi-input transducer structure is adopted to map natural language to vector space through word embedding and enabling word embedding coding to be carried out on basic knowledge sequence D and user characteristic information P respectively to obtain word embedding sequence vectors X (D) and X (P), position coding is adopted to convert the word embedding vectors into sine and cosine vector representations containing various frequencies, and relations among words are captured in high-dimensional vector space to obtain word embedding sequence vectors X with position information _embed(D) and X_embed (P)：

X _embed ＝Embedding+PositionalEncodering，

Embedding(D)＝D·W ^d ，Embedding(P)＝P·W ^p ，

Wherein, the symbol (·) represents word embedded code, W ^d 、W ^p Representing a learnable parameter, locationencoding (), representing position encoding of a word embedding vector using linear transformation of a sine and cosine function, T representing any position in the word embedding sequence, and d representingVector dimension, positioning encoding _(T,2i) Position coding of word embedding sequence T in 2i dimension _(T,2i+1) Representing the position coding of the word embedding sequence T in 2i+1 dimension, aiming at the word embedding vector X obtained after the word embedding and the position coding are completed _embed (D) And X is _embed (P) obtaining a hidden characteristic sequence C of the source text by performing next calculation through an Encoder layer _E ；

Q＝Linear(X _embed )＝X _embed *W _q ，

K＝Linear(X _embed )＝X _embed *W _k ，

V＝Linear(X _embed )＝X _embed *W _v ，

wherein Linear (·) represents a Linear transformation operation, Q, K, V represent a query vector sequence, a key vector sequence, and a value vector sequence, W, respectively _q 、W _k 、W _v Respectively different matrix of the learnable parameters, and W _q W _k W _v ∈R ^d R represents a real number, d is a word embedding vector X _embed Is the dimension of (1), softmax represents the normalized exponential function, K ^T Is a transpose of K, with multiple heads representing multiple different parameter matrices W ⁱ To learn multiple meaning expressions, for X _embed The linear mapping projection is carried out on each characteristic space i, and three weights W are multiplied respectively _q 、W _k 、W _v ，

i.e. (1, h), the h groups of transformed queries, keys and values are attention pooled in parallel, the h groups of attention pooled output heads _i Spliced together and passed through another learnable linear projection matrix W _h Transform to produce final base knowledge multi-headed attention output X _att (D):

X _att (D)＝MultiHead(Q,K,V)＝Concat(head ₁ ；head ₂ ；......；head _h )W _h ，

Parameter matrix representing the i-th feature space median vector V, i E (1, h), W _h Representing a learnable linear projection matrix for a user information attention vector X _att Calculation of (P) and the above-mentioned attention vector X about basic knowledge _att (D) Is the same as calculated;

X _hidden ＝Linear{X _att (D)；X _att (P)}，

FFN＝Linear{ReLU[Linear(X _hidden )]}，

C _E ＝LayerNorm(X _hidden +FFN)，

5) Optimizing the language generation model: the Decoder adopting the multi-input transducer structure calculates the attention vector for the basic knowledge, the user information and the current state simultaneously, the three calculation results are linearly fused to obtain the fused attention characteristic representation, and the fused attention characteristic representation and the source text hiding representation C are combined _E Acquiring context hidden representation C _D Hiding representation C according to context _D Generating a reply text sequence Y, and defining a loss function to optimize a language generation model, as shown in fig. 3, specifically:

wherein ,

where E represents the historical dialog attention feature vector,

Y＝(y ₁ ,y ₂ ,.....,y _t-1 ,y _t )，

P(Y)＝Softmax(FFN(C _D )+C _D )，

/>

Loss＝α*L _P +β*L _D ，

the prediction results of the model on user judgment and knowledge judgment are represented, loss represents a joint Loss function, and alpha and beta represent adjustable weight parameters. />

Claims

1. A dialogue generation method integrating basic knowledge and user information is characterized by comprising the following steps:

1) Building a dialogue data set based on knowledge and role information: constructing a training data set with knowledge and role information at the same time, comprising: data set d= [ D ] based on basic knowledge database DBpedia ₁ ，d ₂ ，...，d _n ]User characteristic information data set p= (P) ₁ ，p ₂ ，....，p _n ) Human-computer interaction is carried out by adopting sentences with role information labels to obtain dialogue data, wherein the dialogue data comprises question sentences and reply sentences

l _m Representing a question sentence @, a->

Representing a reply sentence;

2) Obtaining a user information embedding vector and a basic knowledge embedding vector: mapping natural language to vector space by using a multi-input transducer structure through word embedding and embedding, respectively carrying out word embedding and encoding on basic knowledge sequence D and character information P to respectively obtain word embedding sequence vectors X (D) and X (P), converting the word embedding vectors into sine and cosine vector representations containing various frequencies by position encoding, and capturing relations among words in a high-dimensional vector space to obtain word embedding sequence vectors X with position information _embed(D) and X_embed (P)：

X _embed ＝Embedding+PositionalEncodering，

Embedding(D)＝D·W ^d ，Embedding(P)＝P·W ^p ，

Wherein, the symbol (·) represents word embedded code, W ^d 、W ^p Representing a learnable parameter, locationencoding (), representing position encoding of a word embedding vector using linear transformation of a sine and cosine function, T representing any position in the word embedding sequence, d representing vector dimension, locationencoding _(T，2i) Position coding of word embedding sequence T in 2i dimension _(T，2i+1) Representing the position coding of the word embedding sequence T in 2i+1 dimension, aiming at the word embedding vector X obtained after the word embedding and the position coding are completed _embed (D) And X is _embed (P) obtaining the hidden characteristic sequence C of the source text by calculation of the Encoder layer _E ；

Q＝Linear(X _embed )＝X _embed *W _q ，

K＝Linear(X _embed )＝X _embed *W _k ，

V＝Linear(X _embed )＝X _embed *W _v ，

wherein Linear (·) represents a Linear transformation operation, Q, K, V represent a query vector sequence, a key vector sequence, and a value vector sequence, W, respectively _q 、W _k 、W _v Respectively different matrix of the learnable parameters, and W _q W _k W _v ∈R ^d R represents a real number, d is a word embedding vector X _embed Is the dimension of (1), softmax represents the normalized exponential function, K ^T Is a transpose of K, with multiple heads representing multiple different parameter matrices W ⁱ For X _embed The linear mapping projection is carried out on each characteristic space i, and three weights W are multiplied respectively _q 、W _k 、W _v The multi-headed attention layer merges knowledge from the same attention pooling, which is derived from h different feature spaces of the same query matrix Q, key matrix K and value matrix V, specifically denoted as h sets of different parameter matrices

The h groups of transformed queries, keys and values are attention-pooled in parallel, the h groups of attention-pooled output heads _i Spliced together and passed through a linear projection matrix W which can be learned _h Transforming to generate final basic knowledge multi-head attention output X _att (D)：

X _att (D)＝MultiHead(Q，K，V)＝Concat(head ₁ ；head ₂ ；……；head _h )W _h ，

Wherein MultiHead represents a multi-headed attention function, concat represents a cascading operation，head _i The attention output representing the ith subspace, i e (1, h);

X _hidden ＝Linear{X _att (D)；X _att (P)}，

FFN＝Linear{ReLU[Linear(X _hidden )]}，

C _E ＝LayerNorm(X _hidden +FFN)，

wherein LayerNorm represents layer normalization operation, FFN represents a double-layer fully-connected network with ReLU as an activation function, and encoder source textThe hidden representation C _E As input to the next module decoder;

wherein ,

/>

where E represents the historical dialog attention feature vector,

Represents a fused attention vector fusing basic knowledge, user information and historical session information, and Linear (;) represents cascading operation, C _D The hidden representation of the decoder obtained after the layer normalization operation is represented, the feedforward neural network layer part is the same as the encoder part, namely a two-layer fully-connected network with residual error link is connected with a normalization layer, and the output C of the decoder _D I.e. the output of the last decoder base unit is mapped via linear transformation and Softmax function to the probability distribution of the predicted word at the next time instant, at a given encoder output C _E And the decoder outputs y at the previous time _t-1 Next, predicting a probability distribution P (Y) of the word at the current moment, and generating a functional expression of a probability representation P (Y) of the reply sequence text Y as follows:

Y＝(y ₁ ，y ₂ ，.....，y _t-1 ，y _t )，

P(Y)＝Softmax(FFN(C _D )+C _D )，

wherein Y represents a reply text generated by the model, softmax represents a normalized exponential function, FFN represents a double-layer fully-connected network taking ReLU as an activation function, maximum likelihood estimation is adopted as a loss function, and the whole total loss function fully connects the maximum likelihood estimation functions by adopting weight parameters:

Loss＝α*L _P +β*L _D ，