CN106409283B

CN106409283B - Man-machine mixed interaction system and method based on audio

Info

Publication number: CN106409283B
Application number: CN201610791966.0A
Authority: CN
Inventors: 俞凯; 石开宇; 郑达; 陈露; 常成; 曹迪
Original assignee: Shanghai Jiaotong University
Current assignee: Sipic Technology Co Ltd
Priority date: 2016-08-31
Filing date: 2016-08-31
Publication date: 2020-01-10
Anticipated expiration: 2036-08-31
Also published as: CN106409283A

Abstract

The invention discloses a man-machine mixed interaction system based on audio, wherein a voice recognition module is connected with a semantic recognition module and transmits text information corresponding to voice, an exception handling module is connected with the voice recognition module and the semantic recognition module, the voice recognition module transmits the text information to the exception handling module, and the semantic recognition module transmits a semantic analysis result to the exception handling module; and the exception handling module is connected with the voice synthesis module and transmits intervention information. The invention also discloses a man-machine mixed interaction method based on audio, wherein the voice recognition module converts voice information into character information and outputs the character information to the semantic recognition unit; the semantic recognition unit extracts a user purpose and corresponding key information from the character information; and the exception processing module judges whether the man-machine conversation is abnormal at present according to the character information of the voice recognition module and the semantic information of the semantic recognition module and replies to the exception processing message. The technical scheme of the invention provides uniform man-machine conversation experience.

Description

Man-machine mixed interaction system and method based on audio

Technical Field

The invention relates to the technical field of information processing, in particular to a human-computer hybrid interaction system and a human-computer hybrid interaction method based on audio.

Background

As shown in fig. 1, the current audio-based human-machine dialog systems use machine replies as final replies to the user, and when the machine decision system cannot make clear the user's intention, most dialog systems choose to present replies such as "please say again" to let the user perform new input, wherein some human-machine dialog systems introduce a manual intervention method based on a traffic center.

At present, the existing man-machine conversation exception handling is mainly realized in a telephone traffic center mode, when a machine cannot handle user input audio or a user clearly shows that manual service is needed, a manual telephone traffic center is requested to intervene, at the moment, one-to-one conversation connection is established between the user and a telephone operator, the telephone operator and the user directly communicate with each other, the user's requirements are obtained, and corresponding instructions are issued through a telephone traffic platform.

The problems existing in the manual intervention mode of the existing telephone traffic center are mainly as follows: the manual efficiency is low, the interventionalist needs to establish one-to-one voice communication with the user, and cannot serve other people within the time period of waiting for the input of the user; the cost is high, a series of telecommunication devices and corresponding service integration are needed for a large-scale call center, and meanwhile, due to low efficiency, more intervention operators are needed for intervention service, so that the labor cost is indirectly increased; is greatly influenced by the network environment: the direct transmission of audio by using network resources requires stable network connection, and the fluctuation of the network environment can cause the audio quality to be reduced, thereby influencing the conversation experience and even interrupting the man-machine conversation process.

Therefore, those skilled in the art are dedicated to develop an audio-based human-computer hybrid interactive system and method, which combine a human intervention reply and a machine reply, so as to unify the flow of human-computer conversation and improve the user experience.

Disclosure of Invention

In view of the above-mentioned defects of the prior art, the technical problem to be solved by the present invention is how to improve the efficiency of man-machine conversation and user experience in the customer service process.

In order to achieve the above object, the present invention provides an audio-based human-computer hybrid interaction system, including a speech recognition module, a speech synthesis module, a semantic recognition module, and an exception handling module, wherein the speech recognition module is configured to be connected to the semantic recognition module and transmit text information corresponding to speech, the exception handling module is configured to be connected to the speech recognition module and the semantic recognition module, the speech recognition module is configured to transmit text information to the exception handling module, and the semantic recognition module is configured to transmit a semantic parsing result to the exception handling module; the exception handling module is configured to connect with the speech synthesis module and transmit intervention information.

Further, the speech recognition module comprises a signal processing and feature extraction unit, an acoustic model, a language model and a decoder, wherein the signal processing and feature extraction unit is configured to be connected with the acoustic model and transmit acoustic feature information, and the decoder is configured to be connected with the acoustic model and the language model and output a recognition result.

Further, the voice synthesis module comprises a text analysis unit, a prosody control unit and a synthesized voice unit, wherein the text analysis unit is configured to receive and process text information, transmit a processing result to the prosody control unit and the synthesized voice unit, the prosody control unit is configured to be connected with the synthesized voice unit and transmit pitch, duration, intensity, pause and intonation information, and the synthesized voice unit is configured to synthesize an output voice by an analysis result of the received text analysis unit and a control parameter of the prosody control unit.

Further, the semantic recognition module comprises a field marking unit, an intention judging unit and an information extracting unit, wherein the field marking unit is connected with the intention judging unit and transmits field information, the intention judging unit is connected with the information extracting unit and transmits user intention information, and the information extracting unit outputs a semantic analysis result.

Further, the exception handling module comprises an exception detecting unit, a database querying unit and an intervener unit, wherein the exception detecting unit is configured to receive the outputs of the voice recognition module and the semantic recognition module and decide whether to take intervention measures, the database querying unit is configured to receive an intervention signal of the exception detecting unit and receive semantic information of the semantic recognition module and query and output an intervention message, and the intervener unit is configured to perform necessary preference and modification on the intervention message output by the database querying unit by an intervener and finally output a reply message to a user.

The invention also provides a man-machine mixed interaction method based on the audio, which comprises the following steps:

step 1, providing a voice recognition module, a voice synthesis module, a semantic recognition module and an exception handling module;

step 2, the voice recognition module converts voice information into character information and outputs the character information to the semantic recognition unit;

step 3, the semantic recognition unit extracts the user purpose and corresponding key information from the character information;

and 4, judging whether the man-machine conversation is abnormal at present by the abnormal processing module according to the character information of the voice recognition module and the semantic information of the semantic recognition module, and replying the abnormal processing message.

Further, in step 2, the method specifically comprises the following steps:

2.1, extracting features from the input audio stream for processing by an acoustic model, and simultaneously reducing the influence of environmental noise, channels and speaker factors on the features;

and 2.2, searching a word string capable of outputting the audio stream with the maximum probability as a voice recognition result by the decoder according to the acoustic model, the linguistic model and the dictionary.

Further, in step 3, the method specifically comprises the following steps:

step 3.1, marking the field to which the current conversation belongs by using the symbolic key words in the character information;

step 3.2, judging the user intention based on rules in the field;

and 3.3, extracting specific key information according to the field and the user intention and by combining rules.

Further, in step 4, the method specifically comprises the following steps:

step 4.1, the abnormality detection unit judges whether the current man-machine conversation is abnormal according to the character information of the voice recognition module and the semantic information of the semantic recognition module, and if the current man-machine conversation is abnormal, the interventionalist unit takes over the man-machine conversation;

step 4.2, the database query unit queries the database according to the semantic information to obtain an intervention message with recommendation degree, if the recommendation degree of the intervention message is higher, intervention is directly performed by using the intervention message, and if the recommendation degree is lower, an intervention engineer is requested to perform manual intervention;

and 4.3, when the machine algorithm cannot find the intervention message with high recommendation degree, the intervention engineer intervenes to select and modify the intervention message, and then the modified intervention message is sent to the client.

Further, the key information comprises a conversation field and a conversation keyword, and the conversation keyword comprises a content keyword and an emotion keyword.

Compared with the prior art, the invention has the technical effects that:

1. the efficiency is improved: the time for waiting the input of the user by the intervener is fully utilized, so that the intervener can intervene the service for a plurality of users at the same time, and the intervention efficiency is improved.

2. The cost is reduced: the intervention platform can be built by utilizing the existing computer and the server without purchasing a series of telecommunication devices related to a telephone traffic center.

3. The working scene is rich: because the intervention engineer interface adopts a B/S (Browser/Server Browser/Server) structure, the intervention engineer can perform intervention operation by opening a Browser and logging in a corresponding website, does not need to answer a call on a station, and can perform intervention service on mobile terminals such as a PAD (PAD, a smart phone, a personal notebook and the like.

4. The network requirement is low: the data volume of text transmission is small, so that the requirement on the network is reduced, and meanwhile, the voice heard by the user is synthesized locally and is not influenced by the network condition.

5. Unified human-machine conversation experience: the interventionalist is transparent to the user, and the user's experience is as if talking to a sufficiently intelligent "machine" that it is possible to seamlessly articulate the current man-machine conversation.

The conception, the specific structure and the technical effects of the present invention will be further described with reference to the accompanying drawings to fully understand the objects, the features and the effects of the present invention.

Drawings

Fig. 1 is a schematic diagram of an intervention mode of a conventional traffic center;

FIG. 2 is a block diagram of a system of the present invention;

FIG. 3 is a system flow diagram of a preferred embodiment of the present invention;

FIG. 4 is a diagram illustrating a role dialog flow according to a preferred embodiment of the present invention.

Detailed Description

The invention is realized by the following technical scheme:

as shown in fig. 2, the present invention relates to an audio-based human-machine dialog exception handling system, which comprises: speech recognition module, speech synthesis module, semantic recognition module and exception handling module, wherein: the voice recognition module is connected with the semantic recognition module and transmits text information corresponding to voice, the voice recognition module and the semantic recognition module are both connected with the exception processing module and respectively transmit the text information and a semantic analysis result, and the exception processing module is connected with the voice synthesis module and transmits intervention information.

The voice recognition module comprises: signal processing and feature extraction unit, acoustic model, language model and decoder, wherein: the signal processing and feature extracting unit is connected with the acoustic model and transmits acoustic feature information, and the decoder is connected with the acoustic model and the language model and outputs a recognition result to the outside.

The speech synthesis module comprises: a text analysis unit, a prosody control unit, and a synthesized speech unit, wherein: the text analysis unit receives and processes the text information, transmits the processing result to the rhythm control unit and the synthesized voice unit, the rhythm control unit is connected with the synthesized voice unit and transmits the pitch, the duration, the intensity, the pause, the intonation and other information of the target, and the synthesized voice unit receives the analysis result of the text analysis unit and the control parameters of the rhythm control unit and outputs the synthesized voice to the outside.

The semantic recognition module comprises: the system comprises a field labeling unit, an intention judging unit and an information extracting unit, wherein: the domain marking unit is connected with the intention judging unit and transmits domain information, the intention judging unit is connected with the information extracting unit and transmits user intention information, and the information unit is connected with the outside and transmits semantic analysis information.

The exception handling module comprises: the system comprises an abnormality detection unit, a database query unit and an interventionalist unit, wherein: the abnormality detection unit receives the output of the voice recognition module and the semantic recognition module and decides whether intervention measures are taken or not, the database query unit receives intervention signals of the abnormality detection unit and receives semantic information of the semantic recognition module to query and output intervention messages, and the intervener unit performs necessary preference and modification on the intervention messages output by the database query unit by means of interveners and finally outputs user reply messages.

The invention relates to a man-machine conversation exception handling method of the system, which specifically comprises the following steps:

step 1, providing a voice recognition module, a voice synthesis module, a semantic recognition module and an exception handling module.

Step 2, the voice recognition module converts the voice information into character information and outputs the character information to a semantic recognition unit, and the specific steps comprise:

2.1 front-end processing the audio stream to extract features from the input signal for acoustic model processing. And simultaneously, the influence of environmental noise, channels, speakers and other factors on the characteristics is reduced as much as possible.

2.2 the decoder searches for a word string that can output the input signal with the maximum probability as a speech recognition result based on the acoustic, linguistic models and dictionaries.

Step 3, the semantic recognition unit extracts the user purpose and corresponding key information from the character information, and the specific steps comprise:

3.1 the domain to which the current conversation belongs is marked by using the symbolic key words in the character information.

3.2 in the specific domain, the intention of the user is judged based on the rule.

3.3 according to the field and the user intention, combining with the rule, such as a preset template, extracting the specific key information.

Step 4, the exception handling module judges whether the man-machine conversation is abnormal at present and carries out exception handling and message reply according to the character information of the voice recognition module and the semantic information of the semantic recognition module, and the specific steps comprise:

4.1 the abnormity detection unit judges whether the current man-machine conversation is abnormal according to the character information of the voice recognition module and the semantic information of the semantic recognition module. If not, the local client end processes the abnormal condition, and the intervention server takes over the man-machine conversation.

4.2 the database query unit queries the database according to the semantic information to obtain recommended intervention information, if the recommendation degree of the intervention information is higher, intervention is directly performed by using the intervention information, and if the recommendation degree is lower, manual intervention is requested by an interventionalist.

4.3 when the machine algorithm can not find the intervention message with high recommendation degree, the intervention engineer intervenes to select and modify the intervention message, and then sends the modified intervention message to the client.

In the process of processing the man-machine conversation exception, after the voice input of the user passes through the voice recognition and the semantic analysis of the machine, the voice recognition result and the semantic analysis result are transmitted to an intervener end in a text form, and the intervener can select to send a conversation message or send a command message after receiving the message. The dialog messages are transmitted to the machine in the form of text, and speech is subsequently synthesized by a speech synthesis system (TTS) and played to the user, the command messages being commands executed directly by the machine.

The embodiment includes the following steps, as shown in fig. 3 and fig. 4, that is, three steps of user input- > intervention message generation- > client pushing intervention message are introduced in the technical solution respectively:

1) user input

In the process of voice input of a user, a voice recognition system is utilized to convert voice input audio of the user into characters, semantic analysis is carried out on the characters (the result of the semantic analysis comprises the current dialogue field of the user, key information of service request of the user and the like), and finally the characters and the result of the semantic analysis are transmitted to an abnormal processing module in a text form through a POST method of an HTTP protocol.

2) Intervention message generation

And under the abnormal condition, the abnormal processing module queries the database according to the text information of the voice recognition and the semantic slot of the semantic recognition to obtain alternative intervention information. And if the recommendation degree of the intervention message is higher, directly using the intervention message for intervention, and if the recommendation degree is lower, requesting an intervention engineer for manual intervention. The intervener can see auxiliary data provided by the exception handling module, such as a recognition result input by a user, a result of semantic analysis and the like, on the interface, and can more accurately and quickly screen and modify candidate intervention messages by combining the information. The intervention message is divided into a conversation message and a command message, and is transmitted in a text form by adopting a uniform Websocket protocol, and the difference is different from the transmission content and the processing mode of a machine.

3) Client push intervention messages

The client returns confirmation that the intervention message is received to the intervener immediately after receiving the intervention message, and the intervention message is cached in the message queue. The client monitors the current man-machine conversation state and tries to take out the message from the message queue to push the message to the user under certain conditions, and the specific push opportunity comprises the following steps: 1. when the intervention message arrives and the voice message broadcast of TTS synthesis is finished; the conditions that need to be met are 1, the message queue is not empty, 2, the audio player of the client is currently idle. And if the intervention message is successfully pushed, returning confirmation information of the intervener that the intervention message is pushed.

For example:

1. user a issues a voice instruction "i want to go to a playful place".

2. The voice recognition module converts the voice input into text.

3. The semantic analysis module processes the data to obtain that the user intention is 'navigation' and the label of the target place of navigation is 'good play'.

4. An abnormality detection unit in the abnormality processing module receives a service request of a user A, and the service request comprises a complete voice recognition result of 'i want to go to a playful place' and a semantic analysis result of 'navigation' and 'playful', and detects that an abnormality occurs in the current conversation state.

5. The database query unit in the exception handling module performs database query according to "navigate", "good play", and obtains some alternative messages such as "ask you go to good-play snack in suzhou? "find 5 locations relevant to fun for you", both messages are less recommended and require manual intervention by the interventionalist unit. The intervention engineer uses the database query result obtained by the exception processing module, the semantic analysis result and the text result of the voice recognition to select and modify the intervention message, and changes the intervention message into' asking what entertainment mode you want? ", the text message is sent to the user.

6. After receiving the intervention message, the client stores the intervention message in a message queue, sends feedback that the message is received to the exception handling module, and tries to push the message.

7. And after the condition is met, the voice synthesis system of the intervention message is synthesized and broadcasted, the user hears the audio 'asking what entertainment mode you want', and the client sends 'message pushed' feedback to the exception handling module.

8. Further speech input by the customer "I want to sing" is "

9. ASR systems convert speech input to text

10. Semantic analysis obtains that the user intention is 'navigation', and the navigation target is 'KTV'

11. The anomaly detection unit obtains the specific service requirements of the user A, and the specific service requirements comprise a complete voice recognition result of 'i want to sing' and a semantic analysis result of 'navigation' and 'KTV'.

12. The database query unit searches the database according to the navigation, the KTV and the relevant information of the user to obtain an alternative intervention message, namely, recommending whether xxx asking for you to go or not? "while bypassing the interventionalist unit and sending a text message" ask you for a query about xxx or not for you? "

13. User confirmation to go

14. The exception handling system user pushes a command type intervention message containing the command type "navigate" and POI information for the destination.

15. The client takes out the message of the command type navigation and the corresponding POI information from the message queue to carry out navigation operation, the client sends the feedback that the message is pushed to the exception handling module, and the interaction is finished.

The foregoing detailed description of the preferred embodiments of the invention has been presented. It should be understood that numerous modifications and variations could be devised by those skilled in the art in light of the present teachings without departing from the inventive concepts. Therefore, the technical solutions available to those skilled in the art through logic analysis, reasoning and limited experiments based on the prior art according to the concept of the present invention should be within the scope of protection defined by the claims.

Claims

1. A man-machine mixed interaction system based on audio is characterized by comprising a voice recognition module, a voice synthesis module, a semantic recognition module and an exception handling module, wherein the voice recognition module is connected with the semantic recognition module and transmits text information corresponding to voice, the exception handling module is connected with the voice recognition module and the semantic recognition module, the voice recognition module is configured to transmit the text information to the exception handling module, and the semantic recognition module is configured to transmit a semantic parsing result to the exception handling module; the exception handling module is configured to connect with the speech synthesis module and transmit intervention information; the voice synthesis module is configured to convert the intervention information transmitted by the exception handling module into voice, send and play the voice to a user, and wait for the user to further feed back;

the abnormality processing module comprises an abnormality detection unit, a database query unit and an intervener unit, wherein the abnormality detection unit is configured to receive the outputs of the voice recognition module and the semantic recognition module and decide whether to take intervention measures, the database query unit is configured to receive an intervention signal of the abnormality detection unit, receive semantic information of the semantic recognition module, query and output the intervention information with high recommendation degree to the voice synthesis module; the intervener unit is configured to perform necessary preference and modification on the intervention information with low recommendation degree output by the database query unit by using an intervener, and then transmit the intervention information to the voice synthesis module to obtain a reply message to be further fed back by the user.

2. The audio-based human-computer hybrid interaction system according to claim 1, wherein the speech recognition module comprises a signal processing and feature extraction unit, an acoustic model, a language model, and a decoder, wherein the signal processing and feature extraction unit is configured to be connected to the acoustic model and to transmit acoustic feature information, and the decoder is configured to be connected to the acoustic model and the language model and to output a recognition result.

3. The audio-based human-computer hybrid interactive system according to claim 1, wherein the speech synthesis module comprises a text analysis unit, a prosody control unit and a synthesized speech unit, wherein the text analysis unit is configured to receive and process text information, transmit the processing result to the prosody control unit and the synthesized speech unit, the prosody control unit is configured to be connected to the synthesized speech unit and transmit pitch, duration, intensity, pause and intonation information, and the synthesized speech unit is configured to receive the analysis result of the text analysis unit and the control parameters of the prosody control unit to synthesize the output speech.

4. The audio-based human-computer hybrid interaction system according to claim 1, wherein the semantic recognition module comprises a domain labeling unit, an intention judging unit, and an information extracting unit, wherein the domain labeling unit is configured to be connected with the intention judging unit and transmit domain information, the intention judging unit is configured to be connected with the information extracting unit and transmit user intention information, and the information extracting unit outputs a result of semantic analysis.

5. A man-machine mixed interaction method based on audio is characterized by comprising the following steps:

step 2, the voice recognition module converts voice information into character information and outputs the character information to the semantic recognition module;

step 3, the semantic recognition module extracts the user purpose and corresponding key information from the character information;

step 4, the exception handling module judges whether the man-machine conversation is abnormal at present according to the character information of the voice recognition module and the semantic information of the semantic recognition module and replies to exception handling information;

wherein, in the step 4, the method specifically comprises the following steps:

step 4.2, the database query unit queries the database according to the semantic information to obtain intervention information with recommendation degree, if the recommendation degree of the intervention information is higher, intervention is directly performed by using the intervention information, the intervention information is sent to a client, and the step 2 is entered to wait for further feedback of the user; if the recommendation degree is low, requesting an interventionalist to perform manual intervention;

and 4.3, when the intervention information with high recommendation degree cannot be found by the machine algorithm, an intervention teacher intervenes to select and modify the intervention information, then the modified intervention information is sent to a client, and the step 2 is entered to wait for further feedback of the user.

6. The audio-based human-computer hybrid interaction method according to claim 5, wherein in the step 2, the method specifically comprises the following steps:

7. The audio-based human-computer hybrid interaction method according to claim 5, wherein in step 3, the method specifically comprises the following steps:

step 3.2, judging the user intention based on rules in the field;

8. An audio-based human-computer hybrid interaction method according to claim 5 or 7, wherein the key information includes a dialogue domain, dialogue keywords including content keywords and emotion keywords.