CN114220425A

CN114220425A - Chat robot system and conversation method based on voice recognition and Rasa framework

Info

Publication number: CN114220425A
Application number: CN202111301900.6A
Authority: CN
Inventors: 李年勇; 庄莉; 苏江文; 宋立华
Original assignee: State Grid Information and Telecommunication Co Ltd; Fujian Yirong Information Technology Co Ltd; Great Power Science and Technology Co of State Grid Information and Telecommunication Co Ltd
Current assignee: State Grid Information and Telecommunication Co Ltd; Fujian Yirong Information Technology Co Ltd; Great Power Science and Technology Co of State Grid Information and Telecommunication Co Ltd
Priority date: 2021-11-04
Filing date: 2021-11-04
Publication date: 2022-03-22

Abstract

The invention relates to a chat robot system and a dialogue method based on voice recognition and a Rasa framework, wherein the system comprises a voice service module and an intelligent assistant module; the voice service module comprises a voice recognition unit and a voice synthesis unit, wherein the voice recognition unit is used for recognizing input voice information and converting the input voice information into text information; the voice synthesis unit is used for converting the received text information into voice information; the intelligent assistant comprises a language understanding unit and a dialogue management unit, wherein the language understanding unit is used for classifying user intentions and extracting entities according to text information; and the dialogue management unit is used for correspondingly inputting the user according to the dialogue state and action selection of the maintenance updating user and the understanding result of the voice understanding unit and outputting the replied text information. The chat conversation of the robot is smoother, and the experience of the robot is improved.

Description

Chat robot system and conversation method based on voice recognition and Rasa framework

Technical Field

The invention relates to the field of man-machine conversation, in particular to a chat robot system and a conversation method based on voice recognition and a Rasa framework.

Background

Natural Language Processing (NLP) is a field of computer science, artificial intelligence, linguistics that focuses on the interaction between computers and human (natural) language. The semantic representation of the natural language is obtained through the analysis of grammar, semantics and pragmatics, and the purpose is to generate a machine-readable semantic representation form of the natural language for the chat robot.

With the development of technologies such as artificial intelligence, internet and the like, the chat robot is applied to multiple fields, and relates to the fields of telecommunication, tourism, medical aviation, finance and the like, and the effect is obvious. The artificial intelligence technology breaks through the original technical bottleneck of the chat robot, and practices prove that the use of the chat robot not only can reduce a large amount of labor cost for enterprises, but also can obviously improve the working efficiency. The chat robot has a wide application range at present, including question answering, virtual assistant, conversation, community management, client robot, medical treatment and the like. And as the chat robot is more advanced, more and more scenes or processing problems can be applied.

However, currently, due to the fact that the technology is not mature enough, the chat robot cannot completely perform the same chat mode as people, even though there are many platforms on the market that can build chatbot by themselves, such as microsoft's small ice (chatting is best at present), apple's Siri, Amazon's Amazon echo, and the like, they can all perform chat, but the overall experience is as follows:

the chat robot is still very fool, people cannot understand the chat robot, and the chat robot can not answer questions, so that the actual use effect is far lower than the expectation of the public, and the public thinks that the chat robot is a chicken rib.

The context and meaning comprehension abilities of multi-round interaction are poor, resulting in low fluency of user experience (bottleneck in industry)

Disclosure of Invention

Therefore, a chat robot system and a chat robot method based on voice recognition and a Rasa framework are needed to be provided, and the problem that the existing chat robot has poor understanding ability and low user experience fluency is caused is solved.

In order to achieve the above object, the inventor provides a chat robot system based on voice recognition and Rasa framework, comprising a voice service module and an intelligent assistant module;

the voice service module comprises a voice recognition unit and a voice synthesis unit, wherein the voice recognition unit is used for recognizing input voice information and converting the input voice information into text information; the voice synthesis unit is used for converting the received text information into voice information;

the intelligent assistant comprises a language understanding unit and a dialogue management unit, wherein the language understanding unit is based on a Rasa NLU framework and is used for classifying user intentions and extracting entities according to text information;

the dialogue management unit is based on a Rasa Core framework and used for updating dialogue states and action selections of the user according to maintenance, making a response to input of the user according to an understanding result of the voice understanding unit and outputting replied text information.

Preferably, the speech synthesis unit is specifically configured to convert the text information into a phoneme sequence, mark a start-stop time and a frequency variation of each phoneme, and generate the speech information according to the phoneme sequence and the start-stop time and the frequency variation of the phoneme.

Further optimizing, the voice understanding unit comprises a word segmentation component, an entity extraction component, a feature extraction component and an intention recognition component;

the word segmentation component is used for segmenting sentences in the input text information into independent words;

the entity extraction component is used for extracting set keywords according to the segmented words;

the feature extraction component is used for extracting the features of the sentences according to the segmented words;

the intent recognition component is for recognizing an intent from the extracted features.

Further preferably, the speech understanding unit further comprises an initialization component for initializing content required for the word segmentation component, the entity extraction component, the feature extraction component and the intention recognition component to work.

And further optimizing, wherein the system comprises a business customizing service module, the business customizing service module is used for setting a corresponding business behavior interface according to actual business requirements, and the business behavior interface comprises a chat interface, a voice interface and a ticket booking interface.

Further optimization, the system also comprises a tool management module which is used for content management, story management, offline training, model management and behavior management.

Still provide another technical scheme: a chat robot dialogue method based on voice recognition and a Rasa framework comprises the following steps:

recognizing the input voice information through a voice recognition unit, and converting the input voice information into text information;

classifying user intentions and extracting entities according to the text information through a language understanding unit;

the dialogue management unit updates the dialogue state and action selection of the user, responds according to the intention of the user and the extracted entity, and outputs the responded text information;

the speech synthesis unit converts the corresponding text information into speech information.

Further optimization, the step of converting the responded text information into the voice information by the voice synthesis unit specifically comprises the following steps:

the speech synthesis unit converts the text information into a phoneme sequence, marks the start-stop time and the frequency change of each phoneme, and generates speech information according to the phoneme sequence, the start-stop time and the frequency change of the phonemes.

Different from the prior art, according to the technical scheme, the voice information input by a book is converted into text information through the voice recognition unit, and then the text information is understood through the voice understanding unit based on the Rasa NLU framework, so that the intention classification and entity extraction of a user are obtained; and then updating the dialog state and action selection of the user through the dialog management unit, responding to the input of the user, outputting the corresponding text information, converting the corresponding text information into voice information through the voice synthesis unit, and outputting the voice information to finish the voice chat with the user. By utilizing two core technologies of a voice recognition technology and a Rasa open source machine learning framework, the Rasa framework has the characteristics of easy operation, easy training and reuse of a pattern matching method and a search method; meanwhile, because Rasa nlu provides a pipeline mode, Rasa code provides complete conversation management, expansibility and the use field are greatly improved, and meanwhile, accurate reading of user intention is improved, so that the robot chat conversation is smoother, and user experience is improved.

Drawings

Fig. 1 is a schematic structural diagram of a chat robot system based on speech recognition and Rasa framework according to an embodiment;

fig. 2 is a schematic structural diagram of a chat robot system based on speech recognition and Rasa framework according to an embodiment;

FIG. 3 is a schematic flow chart of speech recognition according to an embodiment;

fig. 4 is a flowchart illustrating a chat robot conversation method based on speech recognition and Rasa framework according to an embodiment of the present invention.

Description of reference numerals:

111. a speech recognition unit 112, a speech synthesis unit;

121. a language understanding unit 122, a dialogue management unit;

130. business customization service module 131, chat interface 132, voice interface 133 and ticket booking interface.

Detailed Description

To explain technical contents, structural features, and objects and effects of the technical solutions in detail, the following detailed description is given with reference to the accompanying drawings in conjunction with the embodiments.

Referring to fig. 1-3, the present embodiment provides a chat robot system based on speech recognition and Rasa framework, including a speech service module and an intelligent assistant module;

the voice service module comprises a voice recognition unit 111 and a voice synthesis unit 112, wherein the voice recognition unit 111 is used for recognizing input voice information and converting the input voice information into text information; the speech synthesis unit 112 is configured to convert the received text information into speech information;

the intelligent assistant comprises a language understanding unit 121 and a dialogue management unit 122, wherein the language understanding unit 121 is based on a Rasa NLU framework, and the language understanding unit 121 is used for classifying user intentions and extracting entities according to text information;

the dialog management unit 122 is based on the Rasa Core framework, and the dialog management unit 122 is configured to update the dialog state and the action selection of the user according to the maintenance, make a response to the input of the user according to the understanding result of the speech understanding unit, and output the replied text information.

Referring to fig. 3, Speech Recognition (Automatic Speech Recognition) is a technology that takes Speech as a research object, and the Speech Recognition technology is a technology that allows a machine to convert a Speech signal into a corresponding text or command through a Recognition and understanding process. Speech recognition is a very extensive cross-discipline, which is very closely related to such disciplines as acoustics, phonetics, linguistics, information theory, pattern recognition theory, neurobiology, etc. The speech recognition unit 111 has three basic units, i.e., feature extraction, pattern matching, reference to a pattern library, and the basic structure is as shown in fig. 3. when inputting speech, the input speech is first preprocessed, then the features of the speech are extracted, and on the basis of the preprocessed features, a template required by the speech recognition is established. And the computer compares the voice template stored in the computer with the characteristics of the input voice signal according to the voice recognition model in the recognition process, and finds out a series of optimal templates matched with the input voice according to a certain search and matching strategy. Then, according to the definition of the template, the identification result of the computer can be given by looking up the table.

The intelligent assistant module comprises a language understanding module (Rasa NLU) and a dialogue management module (Rasa Core), wherein a user inputs corresponding text information (or a voice recognition result) in a man-machine dialogue process, the processes of user intention classification, entity extraction and the like are carried out through a specific language understanding module, meanwhile, a dialogue management module is combined to track the dialogue state of the user, corresponding candidate actions are selected according to a certain strategy, and corresponding actions are executed (including information reply and service behavior execution)

The voice information input by a book is converted into text information through the voice recognition unit 111, and then the text information is understood through the voice understanding unit based on a Rasa NLU frame, so that intention classification and entity extraction of a user are obtained; then, the dialog management unit 122 updates the dialog state and the action selection of the user, responds to the input of the user, outputs the corresponding text information, and then converts the corresponding text information into voice information through the voice synthesis unit 112 for outputting, thereby completing the voice chat with the user. By utilizing two core technologies of a voice recognition technology and a Rasa open source machine learning framework, the Rasa framework has the characteristics of easy operation, easy training and reuse of a pattern matching method and a search method; meanwhile, because Rasa nlu provides a pipeline mode, Rasa code provides complete conversation management, expansibility and the use field are greatly improved, and meanwhile, accurate reading of user intention is improved, so that the robot chat conversation is smoother, and user experience is improved.

In this embodiment, a user may send voice information to the chat robot system through a client, such as a web end or an APP mobile end, and the chat robot may also send voice information of a conversation to the client; the user can also directly input voice information to the chat robot through the voice equipment on the chat robot system, and then the chat robot communicates with the user through the voice equipment of the chat robot.

In this embodiment, the speech synthesis unit 112 is specifically configured to convert the text information into a phoneme sequence, mark the start-stop time and frequency variation of each phoneme, and generate the speech information according to the phoneme sequence and the start-stop time and frequency variation of the phonemes. The Speech synthesis TTS (Text-To-Speech) is generally divided into two steps:

firstly, text processing: the method mainly comprises the steps of converting text information into a phoneme sequence, and marking information such as start-stop time, frequency change and the like of each phoneme. Firstly, segmenting the text information, then converting the text information into a sentence consisting of words, and labeling the formed sentence with information helpful for speech synthesis, such as phoneme level (last phoneme/next phoneme), syllable level (the second syllable of a word), word level (part of speech/position in the sentence), and the like.

Secondly, speech synthesis: the method mainly generates voice information according to information such as a phoneme sequence, start-stop time and frequency change of a labeled phoneme, and mainly comprises three methods: splicing, parametric, and vocal tract simulation.

In this embodiment, the Rasa NLU is an open source natural language processing tool for intent classification, response retrieval, and entity extraction in a conversational robot. The Rasa NLU framework accomplishes intent recognition through managed components. This identification process is not one-step and requires the cooperation of multiple components. As with pipelining, each component processes input data and outputs the results of the processing for use by other components or as a final output. In this way, different processing modes can be provided for each step. Specifically, the speech understanding unit comprises a word segmentation component (token), an entity extraction component (extractor), a feature extraction component (feature) and an intention recognition component (classifier);

the word segmentation component is used for segmenting sentences in the input text information into independent words; chinese word segmentation is performed by adopting a Jieba word segmentation device.

The entity extraction component is used for extracting set keywords according to the segmented words; entities in the statement are extracted using a MitieEntityExtractor.

The feature extraction component is used for extracting the features of the sentences according to the segmented words; feature extraction was performed using regexfeaturer and mitiefeaturer.

The intent recognition component is for recognizing an intent from the extracted features. Intent classification is performed using a sklern intent classifier.

The voice understanding unit further comprises an initialization component, and the initialization component is used for initializing the content required by the work of the word segmentation component, the entity extraction component, the feature extraction component and the intention recognition component.

Rasa Core is, in this embodiment, a dialog management unit 122 provided by the Rasa framework, which is similar to the brain of a chat robot, with the main task of maintaining updated dialog states and action selections, and then responding to user inputs. The dialog state is a representation of chat data which can be processed by a machine, and the dialog state contains all information which may influence the next decision, such as the output of a natural language understanding module, the characteristics of a user and the like; the action selection is to select an appropriate action next based on the current dialog state, and for example, to ask the user about information to be supplemented, to execute an action requested by the user, or the like. As a specific example, the user says "help me book an airline ticket", and the dialog state includes features such as the output of the natural language understanding module, the location of the user, and historical behavior. In this state, the next actions of the system may be:

1. ask the user for the city reached, e.g., "ask for a flight ticket to which city to reserve? "

2. The user is determined to be the departure city (available by location) and the destination city, such as "do you reserve tickets to beijing to shanghai? "

3. Determine departure date to the user, e.g., "ask for a day ticket to be reserved? "

4. Confirm the ticket flight to the user, e.g. "does you reserve xiamen airline MF8555 for? "

5. The user is directly booked.

In this embodiment, the system further includes a business customization service module 130, where the business customization service module 130 is configured to set a corresponding business behavior interface according to the actual business requirement, and the business behavior interface includes a chat interface 131, a voice interface 132, and a ticket booking interface 133.

In this embodiment, the system further comprises a tool management module, and the tool management module is used for content management, story management, offline training, model management and behavior management. The tool management module can realize the on-line collection, on-line labeling and on-line editing of training data required by the intelligent assistant module, and realize the on-line training, publishing and model management of the model. Meanwhile, the online integration of the business customization service module 130 is realized. The chat robot system can be suitable for various service scenes, the operation and maintenance of the system become simple, and the accuracy and the recall rate of human-computer conversation can be continuously improved. The specific implementation method comprises the following steps:

(1) content management:

the content management is mainly used for the input, management and marking of the content in the dialogue system, and the content types can be as follows: text, graphics and text and files. The functions include: content classification management, content addition, editing, deletion and file import.

For the input content information, the data preprocessing of the following steps is carried out through a back-end service:

step 1: and acquiring the title and the content, and adopting the techniques of Excel analysis, Tika extraction and the like.

Step 2: and (4) cleaning data by adopting technologies such as Html label filtering, regular matching, special character filtering, stop word removing and the like.

And step 3: and (3) title leakage repairing, wherein no title is recorded after the

steps

1 and 2, and a seq2seq model is adopted, and the title is automatically generated by using the cleaned text information.

And 4, step 4: manually labeling (this step is not an essential step), the system provides many question templates, manually template-selecting this data, and label setting (default is not set to title), the system automatically generates questions using a template matching method.

(2) Story management:

the story management is mainly used for story scene arrangement of conversation management, so that the system can automatically select a corresponding response strategy according to the intention of a user. The functions include: adding stories, editing on line, deleting and importing.

(3) Performing on-line training:

the online training is mainly used for the online training of the Rasa model, and after the steps (1) and (2) are completed, the system has the online training condition. During on-line training, the system can automatically complete the generation of Rasa training data according to the content management data, then starts model training, automatically stores the training result in the models directory, and records the model data. Rasa training data includes: version, nlu, stores, rules. The method comprises the following specific steps:

Nlu：

the NLU training data stores structured information about the user message, from which the NLU (natural language understanding) is targeted to extract structured information. This typically includes the intent of the user and any entities that their messages contain. Additional information, such as regular expressions and look-up tables, may be added to the training data to help the model correctly identify intents and entities.

stories：

Stories is a type of training data used to train helper dialog management models. Stories can be used to train models that can generalize to invisible conversation paths. A story is a representation of a conversation between the user and the AI assistant, converted to a specific format, where the user input is represented as an intent (and if necessary an entity), and the assistant's response and action are represented as action names.

(4) Model management:

the model management is mainly used for managing the model generated by training, and comprises functions of deleting and releasing the model and the like.

(5) And (4) Action management:

the Action management is mainly used for managing Action interface information customized by a service module, and comprises the following steps: url, action name and action unique key of the calling service. The functions comprise Action registration, editing and deletion.

Referring to fig. 4, in another embodiment, a chat robot conversation method based on speech recognition and Rasa framework is applied to the chat robot system based on speech recognition and Rasa framework, and the conversation method includes the following steps:

s310: recognizing the input voice information through a voice recognition unit, and converting the input voice information into text information;

s320: classifying user intentions and extracting entities according to the text information through a language understanding unit;

s330: the dialogue management unit updates the dialogue state and action selection of the user, responds according to the intention of the user and the extracted entity, and outputs the responded text information;

s340: the speech synthesis unit converts the corresponding text information into speech information.

Converting voice information input by a book into text information through a voice recognition unit, and then understanding the text information through a voice understanding unit based on a Rasa NLU frame to obtain intention classification and entity extraction of a user; and then updating the dialog state and action selection of the user through the dialog management unit, responding to the input of the user, outputting the corresponding text information, converting the corresponding text information into voice information through the voice synthesis unit, and outputting the voice information to finish the voice chat with the user. By utilizing two core technologies of a voice recognition technology and a Rasa open source machine learning framework, the Rasa framework has the characteristics of easy operation, easy training and reuse of a pattern matching method and a search method; meanwhile, because Rasa nlu provides a pipeline mode, Rasa code provides complete conversation management, expansibility and the use field are greatly improved, and meanwhile, accurate reading of user intention is improved, so that the robot chat conversation is smoother, and user experience is improved.

In this implementation, the step of converting the responded text information into the voice information by the voice synthesis unit specifically includes the following steps:

The Speech synthesis TTS (Text-To-Speech) is generally divided into two steps:

It should be noted that, although the above embodiments have been described herein, the invention is not limited thereto. Therefore, based on the innovative concepts of the present invention, the technical solutions of the present invention can be directly or indirectly applied to other related technical fields by making changes and modifications to the embodiments described herein, or by using equivalent structures or equivalent processes performed in the content of the present specification and the attached drawings, which are included in the scope of the present invention.

Claims

1. A chat robot system based on voice recognition and a Rasa framework is characterized by comprising a voice service module and an intelligent assistant module;

2. The system of claim 1, wherein the speech synthesis unit is configured to convert the text information into a sequence of phonemes, mark the start-stop time and frequency variation of each phoneme, and generate the speech information according to the sequence of phonemes, the start-stop time and frequency variation of the phoneme.

3. The speech recognition and Rasa framework based chat robot system of claim 1, wherein the speech understanding unit comprises a word segmentation component, an entity extraction component, a feature extraction component, and an intent recognition component;

4. The speech recognition and Rasa framework based chat bot system of claim 3, wherein the speech understanding unit further comprises an initialization component for initializing content required for the word segmentation component, the entity extraction component, the feature extraction component, and the intent recognition component to work.

5. The chat robot system based on voice recognition and Rasa framework according to claim 1, comprising a service customization service module, wherein the service customization service module is configured to set a corresponding service behavior interface according to actual service requirements, and the service behavior interface includes a chat interface, a voice interface, and a ticket booking interface.

6. The speech recognition and Rasa framework based chat bot system according to claim 1, further comprising a tools management module for content management, story management, offline training, model management, and behavior management.

7. A chat robot dialogue method based on voice recognition and a Rasa framework is characterized by comprising the following steps:

8. The chat robot conversation method based on speech recognition and Rasa framework according to claim 7, wherein the step of the speech synthesis unit converting the responded text information into speech information specifically comprises the steps of: