CN110209791B

CN110209791B - Multi-round dialogue intelligent voice interaction system and device

Info

Publication number: CN110209791B
Application number: CN201910505280.4A
Authority: CN
Inventors: 张韶峰; 冯鑫; 王世朋
Original assignee: Bairong Yunchuang Technology Co ltd
Current assignee: Bairong Yunchuang Technology Co ltd
Priority date: 2019-06-12
Filing date: 2019-06-12
Publication date: 2021-03-26
Anticipated expiration: 2039-06-12
Also published as: CN110209791A

Abstract

A multi-round dialogue intelligent voice interaction system and a device thereof are provided, the system comprises a mixed semantic understanding module, a semantic understanding self-adapting module and an automatic dialogue management module, voice input is converted into text input after voice recognition and then is input into the mixed semantic understanding module, user intention is understood and corresponding state information is extracted, the automatic dialogue management module guides a dialogue process based on the user intention, dialogue text is output and converted into voice output, dialogue is realized, and the semantic understanding self-adapting module is used for optimizing and learning of the mixed semantic understanding module. The invention integrates a plurality of modules of voice recognition, natural language understanding, natural language generation, voice synthesis, conversation management and the like to form a whole set of multi-turn conversation intelligent voice interaction system which is easy to expand, can be configured and can be applied to any scene.

Description

Multi-round dialogue intelligent voice interaction system and device

Technical Field

The invention belongs to the technical field of computers, relates to the fields of natural language processing and artificial intelligence, and discloses a multi-turn dialogue intelligent voice interaction system.

Background

How to solve human language by computer mechanism is always a popular research direction for artificial intelligence and natural language processing, and is also one of the core problems to be solved in the field of modern artificial intelligence. While the application of speech recognition and image recognition technologies is becoming more sophisticated, deep learning technologies have been studied with a high degree of heat in the field of semantic understanding, but there are still a few artificial intelligence products for real spoken language dialogs. Common voice robots in the market are mostly voice assistant robots, are often matched based on key words, can recognize and understand contents simply, are difficult to realize continuous multi-round interaction, and frequently answer questions

Several prior art intelligent voice interaction schemes are described below.

(one) the first technical scheme: speech recognition ASR + text matching

Speech recognition plus text matching is the most traditional way to implement intelligent speech interaction systems, which is often applied to traditional call centers due to its advantages of easy implementation, low dependency on data, etc. Text matching usually adopts accurate matching, such as judging character strings to be equal, or fuzzy matching, such as wildcard by using a regular expression, and the like, to extract keywords from characters of voice recognition, and to perform instruction allocation on the keywords, so as to achieve the purpose of understanding spoken language.

However, such a solution has the following problems: 1) the accuracy of speech recognition is closely related to the dialogue field, and the problem of excessive cost exists in training a speech recognition system in a specific field, 2) spoken dialogue is different from text dialogue, and multiple intentions and even contradictory intentions often appear in a sentence, and 3) when the interactive content becomes complex, the matching grammar written by people can be rapidly increased to the extent of difficult maintenance, and the conflict between the matching grammars often appears, so the semantic recognition accuracy of the speech interactive system is greatly limited.

(II) the second technical scheme: speech recognition ASR + intent recognition + semantic groove

Intention recognition and semantic slot extraction are also common intelligent voice interaction modes, conversation text data are collected in advance, and the text data are labeled and classified, for example:

what is the weather in Beijing tomorrow? "mark as" tomorrow/TIME Beijing/LOC weather/how the INT looks/B ", and on a large amount of marking data, train the classification model to carry on the intention and discern, the common intention discerns models have SVM models or CNN, RNN, etc. deep learning models; and extracting semantic slots according to intentions, wherein common semantic slot extraction modes comprise a syntax analysis mode, a named entity extraction mode, a sequence-to-sequence model mode and the like. As shown in the above example, the intent is "look through the weather", and the extractable semantic slot includes two factors, "TIME (TIME)", "Location (LOC)". The mode of intention recognition and semantic slot extraction increases the accuracy of semantic recognition and the multi-turn conversation capability of the robot to a certain extent. However, in the method 1), a large amount of manual labeling data is relied on, the quality and the quantity of the labeling data determine the quality of final interaction, and the manual labeling data usually takes a long time and is difficult to finish in a short time; 2) only simple 1-2 rounds of language interaction can be realized, and the interactive content is limited by the content of the semantic slot. Therefore, the scheme is limited by the speed capability of expansion, can only be applied to service scenes which are quite mature and have definite slot values, and is difficult to adapt to the scene requirements of rapid development and change of contents.

(III) the third technical scheme is as follows: speech recognition ASR + sequence-to-sequence learning Seq2Seq + text-to-speech TTS

The method of speech recognition, sequence model plus speech synthesis, so-called end-to-end model, uses the Seq2Seq model in deep learning to directly predict the content of the output text from the input text through a large amount of text data generated by the internet, and converts the text into speech for output through speech synthesis.

However, the adoption of the end-to-end model and the direct prediction of the output text from the input text seems to improve the intelligent interaction capability of the robot, but due to the limitation of the current technical level, the interaction mode only considers the currently input text and does not consider the semantics, so that the robot loses the capability of multiple rounds of conversation and can only be used as chatting. Moreover, the interaction loses controllability since the content of its interaction is given entirely by the model. Most often, the model only memorizes the content of the training data to a certain extent, and does not generate self-intelligence to judge the conversation content, so that the conversation interaction logic is disordered and questions are answered. Thus, this solution is more useful for entertainment and is difficult to apply to production practices.

Disclosure of Invention

The invention aims to solve the problems that: the existing intelligent speech recognition technology has a single recognition mode, is not good in spoken language recognition effect with rigorous grammar, cannot recognize complex speech content, or needs a large amount of manual labeling in recognition learning, is not enough in self-adaptive capacity, needs manual intervention to complete data updating, seems to realize multi-turn conversations in part of schemes, but does not recognize the logical relationship among the conversations in the actual recognition process, and is only simple stacking of single-turn conversations. Aiming at the defects of the prior art, the invention provides an intelligent voice conversation system with multiple rounds of conversations.

The technical scheme of the invention is as follows: a multi-round dialogue intelligent voice interaction system comprises a mixed semantic understanding module, a semantic understanding self-adapting module and an automatic dialogue management module, wherein voice input is converted into text input mixed semantic understanding module after voice recognition, user intention is understood and corresponding state information is extracted, the automatic dialogue management module guides dialogue process based on the user intention, dialogue text is output and converted into voice output to realize dialogue, the semantic understanding self-adapting module is used for optimizing and learning of the mixed semantic understanding module,

the mixed semantic understanding module comprehensively judges conversation semantics by adopting a model fusion mode and combining semantic understanding schemes such as text matching, semantic similarity matching, information retrieval, multi-intention classification models and the like, wherein the text matching belongs to a pre-algorithm, sentences are preprocessed to obtain preprocessed conversation texts, and final results are output together by the semantic similarity matching, the information retrieval and the multi-intention classification model fusion mode;

the semantic understanding self-adapting module optimizes the existing hybrid semantic understanding model in a mode of transfer learning and retraining, and comprises a Bi-LSTM language model, a similarity matching model and a multi-intention classification model;

the automatic conversation management module is used for realizing human-computer interaction control and expansion configuration, and comprehensively judging an output interaction instruction based on the state of the current conversation, the identified current user intention and historical conversation interaction information; and multiple interactions and state conversion are realized through multi-intention recognition and one man-machine conversation.

Furthermore, in the hybrid semantic understanding module, text matching is used for establishing a semantic understanding rule, semantic similarity matching adopts a Bi-LSTM neural network language model based on an attention mechanism to establish a semantic matching model, input dialog texts are vectorized and expressed in combination with the semantic understanding rule, the semantic matching model is finely tuned by adopting a twin network training mode, and finally a regression model of a convolutional neural network is trained on the basis of vectorization expression to predict semantic similarity between the two texts; the information retrieval is implemented through a standard corpus database corresponding to the dialog text and the intention category of the standard corpus, a corpus with the highest semantic similarity with the dialog text is retrieved from a standard corpus based on the semantic similarity, and the intention category is used as the intention classification of the dialog text to realize the intention identification of the input text; and the multi-intention classification model combines the standard corpus data and the service data of the application occasion to generate marking data of multi-intention categories, trains a Bi-LSTM network based on an attention mechanism as the multi-intention classification model, performs multi-classification prediction and provides intention categories of the standard corpus.

Further, the semantic understanding adaptive module is used for optimizing the model in the hybrid semantic understanding module, and comprises the following optimizations:

1) importing the newly added corpus data into training data of a Bi-LSTM semantic matching model, training, and updating the network weight;

2) cleaning and filtering newly added labeled data of the multi-intention classification model, screening out an optimal part of labeled data, mixing the optimal part of labeled data into a standard corpus, carrying out intention prediction on all labeled data again, establishing a supervised sequencing model and a corresponding index monitoring mechanism according to text features of the corpus and the expression of the corpus on a labeled data set, and monitoring the accuracy and recall rate of each intention identification and the corpus change of the labeled corpus;

3) importing the newly added annotation data into a Bi-LSTM multi-purpose classification model, updating the network weight of the model, and monitoring the accuracy and the recall rate of the model on a verification set;

4) and automatically deploying the model in the hybrid semantic understanding module after online updating and a standard corpus. .

Furthermore, the automatic dialogue management module adopts a mode of combining a finite state machine with reinforcement learning, and when each pair of dialogue interaction is executed, the state of the current dialogue, the current user intention and historical interaction information are combined with preset dialogue interaction rules and interaction strategies obtained through learning to carry out comprehensive judgment, and interactive operation instructions to be executed by the dialogue robot are output.

The invention also provides a multi-turn dialogue intelligent voice interaction device which is a computer device with a storage medium, wherein a computer program is loaded in the storage medium and is used for realizing the multi-turn dialogue intelligent voice interaction system.

The invention integrates a plurality of modules of voice recognition, natural language understanding, natural language generation, voice synthesis, conversation management and the like to form a whole set of multi-turn conversation intelligent voice interaction system which is easy to expand, can be configured and can be applied to any scene. In the aspect of natural language understanding, the mixed semantic understanding model is innovatively used, the spoken language semantics are understood in real time by combining the traditional natural language processing technology and the deep neural network algorithm on the basis of deeply mining spoken language rules and summarizing semantic categories, man-machine smooth communication is realized, and interactive experience is improved. Meanwhile, in order to reduce the dependence of model optimization on manual labeling to the maximum extent, a natural language understanding self-adaptive module is customized, and model parameters are automatically updated, so that the model can be updated and optimized by itself.

Drawings

FIG. 1 is a system architecture diagram of an embodiment of the present invention.

Fig. 2 is a schematic view of an identification process according to an embodiment of the present invention.

FIG. 3 is a flow chart of a multi-turn dialog according to an embodiment of the present invention.

Detailed Description

The invention provides a multi-round dialogue intelligent voice interaction system, which comprises a mixed semantic understanding module, a semantic understanding self-adaptive module and an automatic dialogue management module, wherein voice input is converted into text input after voice recognition and then is input into the mixed semantic understanding module, user intention is understood and corresponding state information is extracted, the automatic dialogue management module guides a dialogue process based on the user intention, dialogue text is output and converted into voice output to realize dialogue, the semantic understanding self-adaptive module is used for optimizing and learning the mixed semantic understanding module,

Fig. 1 is a schematic diagram of a system structure of a specific implementation of the present invention, in which an ASR module is a speech recognition module, and is configured to transcribe a query text from a speech collected by a user side; the NLU module is a semantic understanding module and is used for understanding user intention and extracting corresponding information, namely a mixed semantic understanding module; the self-adaptive module is a semantic understanding self-adaptive module and assists the self-updating of the NLU module, and the DM module is an automatic conversation management module and guides a conversation process based on the intention of a user; the NLG module is a text generation module and is used for generating a text of voice output based on user intention and knowledge base extraction; the TTS module is a speech synthesis module that converts information to be output into corresponding speech.

The implementation of the various modules of the present invention is described in detail below.

Hybrid semantic understanding module

The mixed semantic understanding module adopts a model fusion mode, mixes semantic understanding schemes such as text matching, semantic similarity matching, information retrieval, intention classification and the like, comprehensively judges dialogue semantics, and greatly increases the accuracy and flexibility of semantic understanding. Text matching is used for establishing a semantic understanding rule, semantic similarity matching adopts a Bi-LSTM neural network language model based on an attention mechanism to establish a semantic matching model, input dialog texts are vectorized and expressed by combining the semantic understanding rule, fine-tune is carried out on the semantic matching model by adopting a twin network training mode, and finally, a regression model of a convolutional neural network is trained on the basis of vectorization expression to predict semantic similarity between the two texts; the information retrieval is implemented through a standard corpus database corresponding to the dialog text and the intention category of the standard corpus, a corpus with the highest semantic similarity with the dialog text is retrieved from a standard corpus based on the semantic similarity, and the intention category is used as the intention classification of the dialog text to realize the intention identification of the input text; and the multi-intention classification model combines the standard corpus data and the service data of the application occasion to generate marking data of multi-intention categories, trains a Bi-LSTM network based on an attention mechanism as the multi-intention classification model, performs multi-classification prediction and provides intention categories of the standard corpus.

1) Text matching

And matching the texts, and carrying out fine-grained induction on common words of the service scene into characteristic words by combining a traditional natural language processing algorithm, such as key word extraction, syntactic analysis and named entity recognition technologies. On the basis of vocabulary classification, semantic understanding rules are established, and one grammar rule is as follows:

[ today ] [ can ] [ repayment ]

The method comprises the following steps of (1) including three fine-grained characteristic words: the three feature words include corresponding common expressions, for example, the feature words include "today, afternoon of today, a moment, an instant, a little, etc., and the feature words can be classified with finer granularity according to the business scene.

The extraction of the characteristic words can be automatically extracted from the text through keyword extraction, syntactic analysis and named entity recognition technologies, and also can be extracted through a character string processing algorithm by manually configuring certain words as the characteristic words, so that the method is very flexible. The text matching rules can be rapidly and flexibly configured without a large amount of manual labels, so that a semantic understanding model can be rapidly realized, and the problem of cold start of a new service scene can be solved by building from scratch.

2) Semantic similarity matching model based on attention mechanism

The Chinese spoken language dialogue has the characteristics of loose grammar structure, variable expression modes, obvious core word meaning and the like. The attention mechanism is a neural network mechanism which puts more weight on the vocabulary determining the text semantics during model training and ignores common spoken words, connective words and the like to a certain extent. Therefore, in order to most effectively mine the semantic information of the short text, the invention trains a Bi-LSTM Neural network Language Model (Attention based Bi-LSTM Neural Language Model) based on the Attention mechanism, constructs a semantic matching Model and vectorizes and represents the input text. Meanwhile, in order to overcome the difficulties of high cost of manual labeling data and small labeling data set, the invention adopts a twin Network (Simease Network) training mode to carry out fine-tune on the Bi-LSTM neural Network semantic matching model, so that the Bi-LSTM neural Network semantic matching model can adapt to a service scene to the maximum extent and generate word vector representation more suitable for the current service scene.

Meanwhile, on the similarity annotation data, each piece of the similarity annotation data is a text with two sentences separated by commas, and the similarity of the corresponding manual annotation is divided into 1-5 grades, wherein 1 represents the least similar text, and 5 represents the complete consistency. In combination with vectorization representation of the semantic matching model, the invention trains a regression model of a Convolutional Neural Network (CNN) for predicting semantic similarity between two texts. When the Matching scores of the two texts exceed a preset threshold (dynamically adjustable), it is considered that the contents of the two text expressions match, that is, the two text expressions are judged to be synonymous, and the intention of the input text is recognized based on the semantic similarity.

3) Information retrieval

Establishing a standard corpus database corresponding to the dialog text and the intention category of the standard corpus, calculating the semantic similarity between the dialog text and the standard corpus, and searching a corpus with the highest semantic similarity with the dialog text from the standard corpus as the intention classification of the dialog text. In the use process of the interactive system, the obtained service data is labeled to supplement the standard corpus, so that the coverage rate and the accuracy of the semantic understanding capability of the interactive system can be improved, the optimization efficiency of the voice interactive system is greatly improved, and the whole system achieves the effect of more intelligent use.

4) Multi-intent classification model

Because the Chinese spoken language has quite rich expression capability, the same intention can have various expression modes, however, the standard corpus cannot be supplemented without limit. Therefore, the invention combines the standard corpus data and the service data to generate the labeling data of multiple intention categories, trains a Bi-LSTM network based on the attention mechanism as a multiple intention classification model, performs multiple classification prediction, provides the intention categories of the standard corpus, and provides support for information retrieval. Due to the capability of automatically extracting features by the neural network, the dependence of a semantic understanding module on a standard corpus is reduced to a great extent, and the prediction accuracy of the model is further improved.

Semantic understanding adaptive module

The semantic understanding self-adaptive module aims to minimize labor cost, intelligently optimizes the models in the existing hybrid semantic understanding module in a mode of transfer learning and retraining, and improves the semantic understanding capability of the existing business scene by a Bi-LSTM semantic matching model, a similarity matching model, a multi-intention classification model and the like.

Carrying out intention labeling on input text data, 1) introducing newly added corpus data into training data of a Bi-LSTM semantic matching model, training, and updating the network weight of the newly added corpus data; 2) and cleaning and filtering the newly added annotation data, screening out an optimal part of annotation data, mixing the optimal part of annotation data into the standard corpus, and performing intention prediction on all the annotation data again. According to the text characteristics of the corpus and the expression of the corpus on the labeled data set, a supervised sequencing model and a corresponding index monitoring mechanism are established, and the accuracy and the recall rate of each intention identification, the corpus change of the labeled corpus and the like are monitored; 3) importing the newly added annotation data into a Bi-LSTM multi-purpose classification model, updating the network weight of the model, and monitoring the accuracy and the recall rate of the model on a verification set; 4) and automatically deploying the updated model and the annotated corpus on line. And the semantic understanding self-adaptive module is used for automatically updating the word vectors, the text semantic similarity model, the standard corpus, the intention classifier and the like, so that the manual intervention is minimized.

Automatic dialogue management module

The automatic dialogue management module is an important module for realizing the controllable and rapid configuration expansion of human-computer interaction and is an important guarantee for the smoothness and the nature of the human-computer interaction. The automatic conversation management module is used for comprehensively judging the output interactive instruction based on the current conversation state, the current user intention and historical interactive information; and through multi-purpose recognition, the automatic dialogue management module can realize multiple interaction and state conversion through one-time man-machine dialogue, so that the fluency of man-machine interaction is greatly increased.

The automatic dialogue management module adopts a mode of combining a Finite-State Machine (finish-State Machine) and reinforcement learning, and transmits the State of the current dialogue, the current user intention and the history interactive information into the dialogue management module when each round of interaction is executed. And the dialogue management module is used for outputting interactive operation which the robot should execute, such as executing the next round of interaction, inquiring a knowledge base, interrupting the voice broadcast of the robot or not, executing default intention or not and the like, by combining a preset interaction rule and an interaction strategy obtained through learning.

Meanwhile, in the automatic dialogue management module, the user can define the interactive flow of a plurality of rounds of dialogue, namely, the interactive rule is set, thereby realizing the rapid expansion of configuration.

The following describes the implementation of the present invention through a specific implementation scenario.

The implementation of the invention is illustrated by taking the communication of financial scenes as an example, and is different from other traditional service industries, the accuracy and the compliance of the occasions of the financial scenes are very strict, the scenes are flexible and changeable, and the requirements on interaction are different because the customers face different groups, so that the requirements on the semantic understanding accuracy, the interaction flow controllability and the scene expansion flexibility of the intelligent interactive robot are very high. However, it is difficult to satisfy the above requirements simultaneously with the existing technical solutions. Due to the communication characteristics of financial scenes, the invention provides a complete multi-turn dialogue intelligent voice interaction system, and realizes the strict control of dialogue flows by opening all modules such as voice recognition, natural language understanding, natural language generation, voice synthesis and the like, can quickly expand configuration, flexibly cope with different interaction scenes, can self-adapt, automatically improve the natural language understanding capability and minimize manual intervention.

The credit card arrearage is taken as an example of a real application scene of the invention for explanation, the accuracy rate of the invention in the pure language perspective (only considering different texts and not considering the occurrence frequency of the texts) intention identification in the scene exceeds 85 percent, and the industry leading level is reached. As a voice interaction system of multi-turn dialogue, when each turn of interaction is carried out, firstly, the voice input of a user is transcribed into a text through an ASR module, and the text content is transmitted into a semantic understanding module; the semantic understanding module processes and extracts features of the input text and predicts the intention of the input text. Transmitting the predicted intention into a dialogue management module, and judging interactive operation to be executed by the robot next step; according to the interactive instruction output by the dialogue management module, the operations of inquiring a knowledge base (optional), generating a text, generating a voice and the like are executed; the voice is transmitted to the client for broadcasting through the network, as shown in fig. 2.

The multi-turn conversation process of the credit card collection scenario is shown in fig. 3, where the node with the bifurcation selection is the main process node, the gray node is the on-hook node at the end of the process, and the white node is the corresponding intention. The whole human-computer interaction process is connected in series according to the intention type through the automatic dialogue management module, so that the non-sensory human-computer interaction is realized.

The invention develops three important modules, namely a hybrid semantic understanding module, a semantic understanding self-adaptive module, an automatic dialogue management module and the like, realizes the strict control of multi-turn dialogue, improves the self-understanding capability of natural language, can be flexibly configured and meets different interaction requirements.

Claims

1. A multi-round dialogue intelligent voice interaction system is characterized by comprising a hybrid semantic understanding module, a semantic understanding self-adaptive module and an automatic dialogue management module, wherein voice input is converted into text input through voice recognition and then is input into the hybrid semantic understanding module, user intention is understood and corresponding state information is extracted, the automatic dialogue management module guides a dialogue process based on the user intention, dialogue texts are output and converted into voice output to realize dialogue, and the semantic understanding self-adaptive module is used for optimizing and learning the hybrid semantic understanding module;

the mixed semantic understanding module comprehensively judges conversation semantics by adopting a model fusion mode and combining semantic understanding schemes such as text matching, semantic similarity matching, information retrieval, multi-intention classification models and the like, wherein the text matching belongs to a pre-algorithm, sentences are preprocessed to obtain preprocessed conversation texts, and final results are output together by the semantic similarity matching, the information retrieval and the multi-intention classification model fusion mode; the text matching is used for establishing a semantic understanding rule, the semantic similarity matching adopts a Bi-LSTM neural network language model based on an attention mechanism to establish a semantic matching model, an input dialog text is vectorized and represented by combining the semantic understanding rule, the fine-tune is carried out on the semantic matching model by adopting a twin network training mode, and finally the semantic similarity between the two texts is predicted by training a regression model of a convolutional neural network based on vectorization representation; the information retrieval is implemented through a standard corpus database corresponding to the dialog text and the intention category of the standard corpus, a corpus with the highest semantic similarity with the dialog text is retrieved from a standard corpus based on the semantic similarity, and the intention category is used as the intention classification of the dialog text to realize the intention identification of the input text; the multi-intention classification model combines standard corpus data and service data of application occasions to generate marking data of multi-intention categories, a Bi-LSTM network based on an attention mechanism is trained to serve as the multi-intention classification model, multi-classification prediction is carried out, and intention categories of the standard corpus are provided;

the semantic understanding self-adapting module optimizes the existing hybrid semantic understanding model in a mode of transfer learning and retraining, wherein the hybrid semantic understanding model comprises a Bi-LSTM semantic matching model, a similarity matching model and a multi-intention classification model, and comprises the following optimization steps:

4) automatically deploying a model in the hybrid semantic understanding module after online updating and a standard corpus;

2. The multi-round dialogue intelligent voice interaction system as claimed in claim 1, wherein the automatic dialogue management module adopts a mode of combining a finite state machine and reinforcement learning, and when each round of dialogue interaction is executed, carries out comprehensive judgment on the state of the current dialogue, the current user intention and historical interaction information by combining preset dialogue interaction rules and interaction strategies obtained through learning, and outputs an interactive operation instruction to be executed by the dialogue robot.

3. A multi-turn dialog intelligent voice interaction device, characterized in that the device is a computer device having a storage medium, wherein a computer program is loaded on the storage medium, and the computer program is used for implementing the multi-turn dialog intelligent voice interaction system of claim 1 or 2.