CN113672714A

CN113672714A - Multi-turn dialogue device and method

Info

Publication number: CN113672714A
Application number: CN202110958910.0A
Authority: CN
Inventors: 曾祥云; 朱姬渊
Original assignee: Shanghai Dashanlin Medical Health Technology Co ltd
Current assignee: Guangzhou Tianchen Health Technology Co ltd
Priority date: 2021-08-20
Filing date: 2021-08-20
Publication date: 2021-11-19

Abstract

The invention discloses a multi-turn dialogue device and a method, wherein the multi-turn dialogue device comprises a data processing module, a representation module, a feature extraction module, a question-answer feature similarity module and an objective function module, wherein: the data processing module is used for analyzing the multi-turn dialogue data of the historical chat to obtain input data; the representation module is used for mapping input data to obtain a sentence vector set; the characteristic extraction module is used for analyzing the sentence vector set; the question-answer characteristic similarity module is used for processing the sentence vector set to obtain a scoring matrix; and the target function module is used for setting a target function suitable for the multi-turn dialogue device according to the grading matrix. Under the condition of small sample size, the multi-turn dialogue device can learn good context characteristics, so that the problems of the user can be predicted more accurately and answers can be provided, the network structure is simple, and a lightweight memory and energy-saving model are realized.

Description

Multi-turn dialogue device and method

Technical Field

The invention relates to the field of natural language processing, in particular to a multi-turn dialogue device and a multi-turn dialogue method.

Background

A universal pre-training language model, such as a bert model, is a multi-layer bidirectional transformer network structure, self-supervision learning is carried out on the basis of massive linguistic data, the accuracy of natural language processing tasks is greatly improved through feature representation obtained through the bert model, but each layer of the bert model is self-supervised, so that the overall complexity of the bert model is O (n2), and a large amount of machine resources are needed.

In a multi-turn chat system with context association, the bert model is not ideal, except for large calculation amount, low speed and high training cost, the biggest defect is that the bert model is based on general corpus training, because semantic features learned by general corpuses lack strongly-related contextual dialogue information, and because corpuses are mostly based on documents and lack dialogue data, the use of the bert model in multi-turn dialogue cannot improve the natural language understanding ability and the accuracy of intention judgment of a robot. Especially in the field of spoken language, specific scenes or professional industry knowledge, the multi-turn chat system with semantic association of upper and lower sentences has limited expressive force and low accuracy.

Disclosure of Invention

The invention provides a multi-turn dialogue device and a multi-turn dialogue method for solving the technical problems in the prior art.

In order to achieve the above object, the present invention provides a multi-round dialog device, which includes a data processing module, a characterization module, a feature extraction module, a question-answer feature similarity module, and an objective function module, wherein:

the data processing module is used for analyzing the multi-round conversation data of the historical chat to obtain input data: the above dialogue text data, question data, and answer data;

the representation module is used for mapping input data to obtain a sentence vector set;

the feature extraction module is used for analyzing the sentence vector set to obtain the above feature vector, the question feature vector and the answer feature vector;

the question-answer feature similarity module is used for processing the above feature vectors, the question feature vectors and the answer feature vectors to obtain a scoring matrix;

and the target function module is used for setting a target function suitable for the multi-turn dialogue device according to the grading matrix.

Further, the mapping the input data by the characterization module comprises:

dividing each sentence of the above dialogue text data, the question data and the answer data into words;

the position of each word is represented by ID;

representing each ID by a random vector with N dimensions;

and obtaining a sentence vector set.

Further, the question-answer feature similarity module is configured to process the feature vectors to obtain a scoring matrix, and the scoring matrix includes:

splicing and summing the above feature vectors and the problem feature vectors;

and carrying out matrix multiplication on the answer feature vector and the features obtained after splicing to obtain a scoring matrix.

Further, the target function module obtains a target function by using softmax as an activation function and carrying out derivation on the cross entropy by using a loss function.

Furthermore, the feature extraction module is formed by stacking a plurality of double encoder modules, wherein each double encoder module is structured by a self-attention layer, a normalization layer, a feedforward neural network layer and a normalization layer which are sequentially connected.

Further, the normalization layer is formed by performing normalization processing on the output vectors after residual connection and input vector addition residual connection.

The invention also discloses a multi-turn dialogue method, which is applied to a multi-turn dialogue device and comprises the following steps:

converting the current input sound of the user into a natural language text;

inputting a multi-turn dialogue device by combining the historical dialogue state and the current natural language text;

predicting the current conversation state by the multi-turn conversation device according to the historical conversation state and the current natural language text;

outputting corresponding system behaviors according to the current conversation state;

converting the system behavior into natural language text or voice to form a round of conversation;

waiting for the next round of voice input by the user to carry out the next round of conversation;

the multi-turn dialog device is any one of the above multi-turn dialog devices.

receiving a natural language text input by a current user;

combining the historical dialogue state and the current natural language text information to input a multi-turn dialogue device;

waiting for the natural language text input by the user in the next round to carry out the next round of conversation;

the multi-turn dialog device is any one of the above multi-turn dialog devices.

The present invention also discloses an electronic device, comprising: the multi-turn dialog method comprises a processor, a storage medium and a bus, wherein the storage medium stores machine-readable instructions executable by the processor, when the electronic device runs, the processor and the storage medium are communicated through the bus, and the processor executes the machine-readable instructions to execute the multi-turn dialog method.

The invention also discloses a storage medium, wherein a computer program is stored on the storage medium, and the computer program is executed by a processor to execute the multi-turn dialogue method.

In practical applications, the modules in the method and system disclosed by the invention can be deployed on one target server, or each module can be deployed on different target servers independently, and particularly, the modules can be deployed on cluster target servers according to needs in order to provide stronger computing processing capacity.

Therefore, even under the condition that the sample size is not large, the multi-round conversation device can learn good context characteristics, so that the problems of the user can be predicted more accurately and answers can be provided, the network structure is simple, a lightweight memory and an energy-saving model are realized, the model can be trained under a small amount of conversation linguistic data, and the natural language understanding capability of the robot is further improved.

In order that the invention may be more clearly and fully understood, specific embodiments thereof are described in detail below with reference to the accompanying drawings.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the description of the embodiments will be briefly introduced below. It is obvious that the drawings in the following description are only some embodiments of the application, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.

Fig. 1 is a schematic structural diagram of a multi-turn dialog apparatus according to an embodiment of the present application.

Fig. 2 is a schematic structural diagram of an embodiment of a multi-turn dialog apparatus according to the present application.

Wherein: the system comprises a data processing module 1, a representation module 2, a feature extraction module 3, a question answering feature similarity module 4 and an objective function module 5.

Detailed Description

Referring to fig. 1, fig. 1 shows a schematic structural diagram of a multi-turn dialog device,

the multi-turn dialogue device is characterized in that the previous scenes, the current problems and the corresponding reply contents of historical multi-turn dialogue are respectively put into a shared transform encoder to obtain representations based on the previous scenes, the problems and the answers, then the representations of the previous scenes and the current problems are fused to obtain fusion characteristics, then the fusion characteristics and the answer representations are interacted to obtain a similarity matrix, and finally the number of turns of the next turn of dialogue is predicted by the representations of the similarity matrix to construct an objective function.

The multi-turn dialog device constructed as described above can learn good contextual characteristics even under a condition that the sample size is not large, thereby being able to predict the user's question and provide an answer more accurately.

As an implementation manner, the multi-round dialog device in the embodiment of the present application includes a data processing module, a characterization module, a feature extraction module, a question-answer feature similarity module, and an objective function module, where:

the data processing module splits and analyzes the multi-round conversation data of the historical chat, divides the multi-round conversation data into the text (namely historical conversation content), the questions and the answers, and constructs the text data, the question data and the answer data of the conversation, the human questions and the robot replies, namely the text data, the question data and the answer data of the conversation, and the text data, the question data and the answer data are used as the input data of the model or used for training the model.

The human question refers to a question which is provided by a user to the chat robot or the intelligent client question-answering system, and the answer data replied by the robot refers to answer data replied by the chat robot or the intelligent client question-answering system according to the question provided by the user.

The representation module maps or converts the input data, namely the above dialogue text data, the question data and the answer data, to obtain a sentence vector set.

As a preferred embodiment, the realization mode is as follows:

the words of each sentence of the above dialogue text data, question data and answer data are firstly segmented, then the position or address of each word is represented by ID, and the ID information of each word is represented by N-dimensional (for example, 512-dimensional) random vector, so as to construct a sentence vector set of the above dialogue text data, question data and answer data, and then the sentence vector set can be input into the feature extraction module for feature extraction or extraction.

In this embodiment, the position or address of each word is represented by an ID, so that in the multi-turn dialogue apparatus of the present application, it is used to select a mask for predictive training at the time of pre-training. When the method is specifically implemented, a certain probability algorithm can be designed, each ID has a certain probability or is randomly selected to be replaced by being shielded, and then the front and rear words of the words to be shielded can be used for guessing what the words to be shielded are.

In addition, the turns of the current conversation and the occluded words are used as labels for sentences and words, respectively.

The feature extraction module is used for analyzing the sentence vector set to obtain the above feature vector, the question feature vector and the answer feature vector.

As a preferred implementation mode, the feature extraction module is formed by stacking a plurality of double-encoder modules, a double encoder sharing 4 layers is adopted, each layer adopts a self-attention mechanism, and the structure of each double encoder is composed of a self-attention layer, a normalization layer, a feedforward neural network layer and a normalization layer which are connected in sequence.

In this embodiment, the feature extraction module uses a self-attention layer of a self-attention mechanism as a part of its technical solution, the self-attention mechanism may fully consider semantics and grammatical relations between different words in a sentence, and the word vector obtained by such calculation may further consider a relation between contexts, for example, "the birdcanflyblueriesutiawing" in a sentence, in which the machine may be able to link it and bird, and in a system of multiple rounds of conversations, the semantics of the contexts may be understood more.

All be provided with the normalization layer after self-attention layer and feedforward neural network layer, the advantage of normalization makes the characteristic distribute in less controllable value domain space, reduces the search range, not only accelerates the speed of training, has also improved the stability of training moreover, lets the many rounds of dialogue device convergence of this application faster, and the rate of accuracy is higher. The method and the device can realize a lightweight design system, and the multi-turn dialogue device can well understand the context and predict the context even under the condition that the sample size is not large.

In a more preferred embodiment, the normalization layer is performed after the output vector is subjected to residual concatenation and the input vector addition residual concatenation. In the preferred embodiment, the technical effect obtained by residual connection is to prevent information loss when the number of network layers is too large, and further improve the accuracy of the present application.

The question-answer feature similarity module is used for processing the above feature vectors, the question feature vectors and the answer feature vectors to obtain a scoring matrix.

Referring to fig. 2, as a preferred implementation manner, the question-answer feature similarity module in the embodiment of the present application fuses the above feature vectors and the question feature vectors, that is, performs splicing summation, so as to obtain richer semantic features, that is, global features of the above dialog and local features of the current question.

And then carrying out matrix multiplication on the answer feature vector and the spliced above and question features to obtain a scoring matrix. The purpose of the scoring matrix is to enable the combined features formed by the spliced text and the question features to be distributed more closely to the answer features, so that the characterization capability of the multi-turn dialogue device is improved, and the combined features are also used as input features of a cross entropy loss function of the dynamic turn number label.

And the target function module is used for setting a target function suitable for the multi-turn dialogue device according to the grading matrix. In a preferred embodiment, the objective function module of the present application obtains the objective function by derivation using softmax as an activation function and a loss function as a cross entropy.

Based on the multi-turn dialogue device of the above embodiment, the present application also discloses a multi-turn dialogue method, which includes the steps of:

converting the current input sound of the user into a natural language text;

the multi-turn dialog device used is the multi-turn dialog device of the above-described embodiment.

In addition, based on the above embodiment, a variation of the multi-turn dialog method includes:

receiving a natural language text input by a current user;

The present application further provides an electronic device, comprising: the system comprises a processor, a storage medium and a bus, wherein the storage medium stores machine-readable instructions executable by the processor, when the electronic device runs, the processor is communicated with the storage medium through the bus, and the processor executes the machine-readable instructions to execute the method according to the embodiment.

The present application also provides a storage medium having stored thereon a computer program which, when executed by a processor, performs the method as described in the above embodiments.

It should be noted that, all or part of the steps in the methods of the above embodiments may be implemented by hardware related to instructions of a computer program, which may be stored in a computer-readable storage medium, which may include, but is not limited to: a Read Only Memory (ROM), a Random Access Memory (RAM), a magnetic or optical disk, or the like.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. The utility model provides a many rounds of dialogue devices, characterized by includes data processing module, characterization module, characteristic extraction module, question answering characteristic similarity module, objective function module, wherein:

2. The multi-turn dialog device of claim 1, wherein the characterization module mapping the input data comprises:

the position of each word is represented by ID;

representing each ID by a random vector with N dimensions;

and obtaining a sentence vector set.

3. The multi-turn dialog device of claim 1, wherein the question-answer feature similarity module is configured to process the feature vectors to obtain a scoring matrix comprising:

splicing and summing the above feature vectors and the problem feature vectors;

4. The multi-turn dialog device of claim 1, wherein the objective function module derives the objective function using softmax as an activation function and the loss function as a cross-entropy.

5. The multi-turn dialog device of claim 1, wherein the feature extraction module is formed by stacking a plurality of dual encoder modules, wherein each dual encoder module is configured to have a self-attention layer, a normalization layer, a feedforward neural network layer, and a normalization layer connected in sequence.

6. The multi-turn dialog device of claim 5, wherein the normalization layer is performed after the output vector is connected to the input vector-added residual via residual connection.

7. A multi-turn dialogue method is applied to a multi-turn dialogue device and comprises the following steps:

converting the current input sound of the user into a natural language text;

characterized in that the multi-turn dialog device is the multi-turn dialog device of any of claims 1-6.

8. A multi-turn dialogue method is applied to a multi-turn dialogue device and comprises the following steps:

receiving a natural language text input by a current user;

9. An electronic device, comprising: a processor, a storage medium and a bus, the storage medium storing machine-readable instructions executable by the processor, the processor and the storage medium communicating via the bus when the electronic device is operating, the processor executing the machine-readable instructions to perform the method of claim 7 or 8.

10. A storage medium, characterized in that the storage medium has stored thereon a computer program which, when being executed by a processor, carries out the method according to claim 7 or 8.