CN106409283A

CN106409283A - Audio frequency-based man-machine mixed interaction system and method

Info

Publication number: CN106409283A
Application number: CN201610791966.0A
Authority: CN
Inventors: 俞凯; 石开宇; 郑达; 陈露; 常成; 曹迪
Original assignee: Shanghai Jiaotong University
Current assignee: Sipic Technology Co Ltd
Priority date: 2016-08-31
Filing date: 2016-08-31
Publication date: 2017-02-15
Anticipated expiration: 2036-08-31
Also published as: CN106409283B

Abstract

The invention discloses an audio frequency-based man-machine mixed interaction system, a voice recognition module and a semantic recognition module are interconnected and used for transmitting word information corresponding to voice, an exception handling module is connected with the voice recognition module and the semantic recognition module, the voice recognition module is used for transmitting the word information to the exception handling module, the semantic recognition module is used for transmitting semantic parsing results to the exception handling module, and the exception handling module and a voice synthesis module are interconnected and used for transmitting intervention information. The invention also discloses an audio frequency-based man-machine mixed interaction method, voice information is converted into the word information via the voice recognition module, the word information is output to a semantic recognition unit, a user aim and corresponding key information are extracted from the word information via the semantic recognition unit, and whether anomaly occurs during a current human computer dialogue is determined by the exception handling module according to the word information of the voice recognition module and semantic information of the semantic recognition module, and an information reply for the anomaly is given. The audio frequency-based man-machine mixed interaction system and method disclosed in a technical solution of the invention can provide unified human computer dialogue experience.

Description

Man-machine mixing interactive system based on audio frequency and method

Technical field

The present invention relates to technical field of information processing, more particularly, to a kind of man-machine mixing interactive system based on audio frequency and side Method.

Background technology

As shown in figure 1, the interactive system being currently based on audio frequency all replies as finally replying using machine and presents to User, when machine decision system can not specify user view, most of conversational system selects to assume " pardon " etc The input to allow user carry out again for the reply, wherein part interactive system introduces the manual intervention based on telephone traffic center Method.

Existing human computer conversation's abnormality processing is mainly realized by telephone traffic center form at present, cannot process user in machine defeated Enter audio frequency or user explicitly indicate that need manual service when, the intervention of business center, now user and traffic if request is artificial Set up man-to-man call between member to connect, operator is directly exchanged with user, know the demand of user and pass through traffic Platform issues corresponding instruction.

The problem that the manual intervention mode at live traffic center exists mainly has：Man efficiency is low, and intervening teacher is needed with user Set up man-to-man speech exchange, in the time period waiting user input, cannot be serviced other people；High cost, exhales on a large scale Make center need a series of telecommunication apparatus and respective service integrated, simultaneously because efficiency is low, need more intervention Shi Jinhang Intervene service, thus indirectly improve human cost；Big by network environment influence：Directly transmit audio frequency needs using Internet resources Stable network connection, the fluctuation of network environment can lead to audio quality to decline thus affecting dialogue experience, or even interrupts man-machine Conversation process.

Therefore, those skilled in the art is devoted to developing a kind of man-machine mixing interactive system based on audio frequency and method, Manual intervention reply is replied with machine and combines, thus unified interactive flow process and lifting Consumer's Experience.

Content of the invention

In view of the drawbacks described above of prior art, the technical problem to be solved be how to improve customer service during Interactive efficiency and Consumer's Experience.

For achieving the above object, the invention provides a kind of man-machine mixing interactive system based on audio frequency, know including voice Other module, voice synthetic module, semantics recognition module and exception processing module, wherein, described sound identification module is configured It is to be connected with described semantics recognition module and transmit the corresponding Word message of voice, described exception processing module is configured to and institute State sound identification module to be connected with described semantics recognition module, described sound identification module is configured to transmit Word message to institute State exception processing module, described semantics recognition module is configured to transmit semantic analysis result to described exception processing module；Institute State exception processing module to be configured to be connected with described voice synthetic module and transmit intervention information.

Further, described sound identification module includes signal processing and feature extraction unit, acoustic model, language model And decoder, wherein, described signal processing and feature extraction unit are configured to be connected with described acoustic model and transmit sound Learn characteristic information, described decoder is configured to be connected with described acoustic model and described language model and export recognition result.

Further, described voice synthetic module includes text analysis unit, prosodic control unit and synthesis voice list Unit, wherein, described text analysis unit is configured to receive text message and described text message is processed, and will process knot Fruit is transferred to described prosodic control unit and described synthesis voice unit, and described prosodic control unit is configured to and described synthesis Voice unit is connected, and transmits pitch, the duration of a sound, loudness of a sound, pause and prosody information, and described synthesis voice unit is configured to institute State the voice of the analysis result receiving text analysis unit and the control parameter synthesis output of described prosodic control unit.

Further, described semantics recognition module includes field mark unit, intention determination unit, information extraction unit, Wherein, described field mark unit is configured to be connected with described intention determination unit and transmission field information, and described intention is sentenced Disconnected unit is configured to be connected with described information extraction unit and transmit user intent information, and described information extraction unit exports language The result of justice analysis.

Further, described exception processing module includes abnormality detecting unit, data base querying unit and intervenes Shi Dan Unit, wherein, described abnormality detecting unit is configured to receive the output of described sound identification module and described semantics recognition module, And deciding whether to take intervening measure, described data base querying unit is configured to receive the intervention letter of described abnormality detecting unit Number, and receive the semantic information of described semantics recognition module, inquire about and export intervention message, described intervention Shi Danyuan is configured to Using intervene teacher's described intervention message that described data base querying unit export carry out necessary preferentially and change, finally defeated Go out replying message to user.

Present invention also offers a kind of man-machine mixing exchange method based on audio frequency, comprise the following steps：

Step 1, offer sound identification module, voice synthetic module, semantics recognition module and exception processing module；

Voice messaging is converted to Word message and exports to described semantics recognition list by step 2, described sound identification module Unit；

Step 3, described semantics recognition unit extract customer objective and corresponding key message from Word message；

Step 4, described exception processing module are according to the Word message of described sound identification module and described semantics recognition The semantic information of module judges whether human computer conversation abnormal and for abnormality processing message reply currently.

Further, in step 2, following steps are specifically included：

Step 2.1, from input audio stream extract feature supply acoustic model process, simultaneously reduce environment noise, channel The impact described feature being caused with speaker's factor；

Step 2.2, decoder, according to acoustics, linguistic model and dictionary, the result to described acoustic model, are found The word string of described audio stream can be exported with maximum of probability, as the recognition result of voice.

Further, in step 3, following steps are specifically included：

Step 3.1, using the significant field belonging to keyword tag current session in Word message；

Step 3.2, rule-based in described field user view is judged；

Step 3.3, according to described field and described user view, binding rule, specific key message is carried Take.

Further, in step 4, following steps are specifically included：

Step 4.1, abnormality detecting unit are according to the Word message of described sound identification module and described semantics recognition mould The semantic information of block judges whether current human computer conversation exception, if abnormal, takes over human computer conversation by intervening Shi Danyuan；

Step 4.2, data base querying unit carry out the inquiry of data base according to semantic information, obtain thering is the dry of recommendation degree Pre- message, if the recommendation degree of intervention message is higher, is directly intervened using this intervention message, if recommendation degree is relatively low, Then intervention required Shi Jinhang manpower intervention；

Step 4.3, when machine algorithm cannot find the intervention message of high recommendations degree, intervene teacher and intervene and carry out intervention message Selection and modification, subsequently amended intervention message is sent to client.

Further, described key message includes dialogue field, dialogue key word, and described dialogue key word includes content and closes Keyword and emotion key word.

Compared with prior art, the technique effect of the present invention includes:

1st, efficiency improves：Take full advantage of the time intervening teacher's wait user input so that intervention teacher can be simultaneously to multiple User carries out intervening service, improves the efficiency intervened.

2nd, cost reduces：A series of related telecommunication apparatus of telephone traffic center need not be purchased, using existing computer and Server can build Interference service platform.

3rd, operative scenario is enriched：Employ B/S (Browser/Server browser/server) knot due to intervening teacher interface Structure, intervention teacher opens the corresponding website of browser login and can carry out intervention operation it is not necessary to receive calls on station, permissible The mobile terminals such as PAD, smart mobile phone, personal notebook carry out intervening service.

4th, network requirement is low：The data volume very little of File Transfer, thus the requirement to network reduces, user's uppick simultaneously Voice by locally synthesizing, do not affected by network condition.

5th, unified human computer conversation's experience：For user, it is transparent for intervening teacher, and the experience of user is as filled with one " machine " that divide intelligence, can be with the current man-machine conversation mode of seamless connection in dialogue.

Technique effect below with reference to design, concrete structure and generation to the present invention for the accompanying drawing is described further, with It is fully understood from the purpose of the present invention, feature and effect.

Brief description

Fig. 1 is the intervention mode schematic diagram of existing tradition telephone traffic center；

Fig. 2 is the system module schematic diagram of the present invention；

Fig. 3 is the system flow schematic diagram of a preferred embodiment of the present invention；

Fig. 4 is the part dialog schematic flow sheet of a preferred embodiment of the present invention.

Specific embodiment

The present invention is achieved by the following technical solutions：

As shown in Fig. 2 the present invention relates to a kind of human computer conversation's abnormality processing system based on audio frequency, including：Speech recognition Module, voice synthetic module, semantics recognition module and exception processing module, wherein：Sound identification module and semantics recognition mould Block is connected and transmits the corresponding Word message of voice, sound identification module and semantics recognition module all with exception processing module phase Even, and respectively Word message and semantic analysis result are transmitted, exception processing module is connected with voice synthetic module and transmits intervention Information.

Described sound identification module includes：Signal processing and feature extraction unit, acoustic model, language model and solution Code device, wherein：Signal processing and feature extraction unit are connected with acoustic model and transmit acoustic featuress information, decoder and acoustics Model is connected with language model, exports recognition result to external world.

Described voice synthetic module includes：Text analysis unit, prosodic control unit and synthesis voice unit, its In：Text analysis unit receives text message and it is processed, and result is transferred to prosodic control unit and synthesis Voice unit, prosodic control unit is connected with synthesis voice unit, and the pitch of transmission objectives, the duration of a sound, loudness of a sound, pause and intonation Etc. information, synthesize voice unit and receive the analysis result of text analysis unit and the control parameter of prosodic control unit, to external world The voice of output synthesis.

Described semantics recognition module includes：Field mark unit, intention determination unit, information extraction unit, wherein：Neck Domain mark unit is connected with intention determination unit and transmission field information is it is intended that judging unit is connected with information extraction unit and passes Defeated user intent information, information unit is connected with the external world and transmits the information of semantic analysis.

Described exception processing module includes：Abnormality detecting unit, data base querying unit, intervention Shi Danyuan are with wherein： Abnormality detecting unit receives the output of sound identification module and semantics recognition module, and decides whether to take intervening measure, data The intervention signal of library inquiry unit reception abnormality detecting unit, and receive the semantic information of semantics recognition module, inquire about and export Intervention message, intervene Shi Danyuan using intervene teacher's intervention message that data base querying unit is exported carry out necessary preferentially and Modification, final output user reply message.

The present invention relates to human computer conversation's abnormality eliminating method of said system, specifically include following steps：

Step 1, offer sound identification module, voice synthetic module, semantics recognition module and exception processing module.

Voice messaging is converted to Word message and exports to semantics recognition unit by step 2, sound identification module, concrete step Rapid inclusion：

2.1 front-end processing audio streams, extract feature from input signal, process for acoustic model.Reduce as far as possible simultaneously The impact that the factors such as environment noise, channel, speaker cause to feature.

To the signal inputting according to acoustics, linguistic model and dictionary, searching can be exported 2.2 decoders with maximum of probability The word string of this signal, as the recognition result of voice.

Step 3, semantics recognition unit extract customer objective and corresponding key message, concrete steps from Word message Including：

3.1 utilize the significant field belonging to keyword tag current session in Word message.

3.2 in specific field the rule-based intention to user judge.

3.3 according to field and user view, binding rule, and template for example set in advance, to specific key message Extracted.

Step 4, exception processing module are believed according to the Word message of sound identification module and the semantic of semantics recognition module Breath judges whether human computer conversation currently exception and carry out the reply of abnormal process and message, and concrete steps include：

4.1 abnormality detecting unit are sentenced according to the Word message of sound identification module and the semantic information of semantics recognition module Whether current human computer conversation of breaking exception.Abnormal then processed by local client, abnormal then by intervention server Adapter human computer conversation.

4.2 data base querying units carry out the inquiry of data base according to semantic information, obtain the intervention message recommended, if The recommendation degree of intervention message is higher, then directly intervened using this intervention message, if recommendation degree is relatively low, intervention required teacher Carry out manpower intervention.

4.3, when machine algorithm cannot find the intervention message of high recommendation degree, intervene the choosing that teacher's intervention carries out intervention message Select and change, subsequently send amended intervention message to client.

During human computer conversation's abnormality processing, the speech recognition of machine and semantic solution are passed through in the phonetic entry of user After analysis, the result of the recognition result of voice and semantic parsing can be passed to intervention Shi Duan in a text form, intervene teacher and accept Can select to send conversation message or transmitting order to lower levels message to after message.Conversation message is transferred to machine in a text form Device, subsequently synthesizes voice by speech synthesis system (TTS) and plays to user, command messages are then directly to be executed by machine Order.

The present embodiment comprises the following steps, as shown in Figure 3 and Figure 4, i.e. user input -->Intervention message generates -->Client Machine pushes the introduction that three steps of intervention message carry out technical scheme respectively：

1) user input

During user carries out phonetic entry, using speech recognition system the phonetic entry audio frequency of user is converted to Word, this word is carried out with semantic analysis simultaneously, and (result of semantic analysis includes the current dialogue field of user, user's request Key message of service etc.), finally the result of word and semantic analysis is passed through in the form of text the POST side of http protocol Method is transferred to exception processing module.

2) intervention message generates

Exception processing module in abnormal cases, inquire about by the text message according to speech recognition and the semantic groove of semantics recognition Data base, obtains alternative intervention message.If the recommendation degree of intervention message is higher, directly done using this intervention message In advance, if recommendation degree is relatively low, intervention required Shi Jinhang manpower intervention.Intervene teacher can see by abnormality processing mould on interface Result of the recognition result of assistance data such as user input of block offer and semantic analysis etc., intervenes teacher's energy in conjunction with these information Enough more quickly and accurately candidate's intervention message is screened and changed.Intervention message is divided into conversation message and command messages, all Be transmitted using unified Websocket agreement in a text form, its difference from the different of transferring content and machine Processing mode different.

3) client computer pushes intervention message

Client computer immediately returns to after receiving intervention message intervene the confirmation of teacher's " message has been received by ", and by intervention message It is buffered in message queue.Client computer can be monitored current human computer conversation's state and attempt under certain condition from message queue Take out message to push to user, specific push opportunity includes：1st, intervention message reach when, 2, TTS synthesis speech message When report completes；Need the condition meeting be 1, message queue be not empty, 2, the audio player current idle of client computer.If Intervention message successfully pushes, and returns the confirmation intervening teacher's " intervention message pushes ".

For example：

1st, user A sends phonetic order " I will go to a joyful place ".

2nd, phonetic entry is converted to word by sound identification module.

3rd, semantic module obtains user view after processing is " navigation ", and the label on the target ground of navigation is " joyful ".

4th, the abnormality detecting unit in exception processing module receives the service request of user A, comprises complete speech recognition Result " I will go to a joyful place ", and the result " navigation " of semantic analysis, " joyful ", are detected simultaneously by current dialogue State occurs abnormal.

5th, the data base querying unit in exception processing module according to " navigation ", " joyful " carry out data base querying, obtain Some alternate message are such as " may I ask the joyful snack in your Suzhou to be gone？", " be you look for 5 to joyful related place ", Recommendation degree all ratios of this two message are relatively low, therefore the manpower intervention of intervention required Shi Danyuan.Intervene Shi Liyong exception processing module The text results of the database query result obtaining and semantic analysis result and speech recognition carry out intervention message selection and Modification, intervention message is changed to " may I ask what kind of entertainment way you want？", send text message to user.

6th, client computer is deposited into message queue after receiving intervention message, sends " message has been received by " to exception processing module Feedback, and attempt being pushed.

7th, condition carries out the speech synthesis system synthesis of intervention message and reports after meeting, and user hears that audio frequency " be may I ask What kind of entertainment way you want ", client computer sends " message pushes " feedback to exception processing module.

8th, client carries out further phonetic entry " I will go to sing "

9th, phonetic entry is converted to word by ASR system

10th, semantic analysis obtain user view is " navigation ", and the target of navigation is " KTV "

11st, abnormality detecting unit obtains the specific service demand of user A, and " I will go to comprise complete voice identification result Sing ", and the result of semantic analysis " navigation ", " KTV ".

12nd, data base querying unit according to " navigation ", " KTV " and the relevant information of user carries out the search of data base, Obtain alternative intervention message " recommend xxx may I ask for you whether to go to？", simultaneously because recommendation degree is very high, therefore bypass intervention Shi Dan Unit, directly sends word message to client computer " recommend xxx may I ask for you whether to go to？“

13rd, user confirms to go to

14th, abnormality processing system user pushes the intervention message of command type, comprises command type " navigation " and purpose The POI on ground.

15th, client computer takes out the message of command type " navigation " and corresponding POI from message queue, is led Boat operation, client computer sends " message pushes " feedback to exception processing module, and interaction terminates.

The preferred embodiment of the present invention described in detail above.It should be appreciated that the ordinary skill of this area need not be created The property made work just can make many modifications and variations according to the design of the present invention.Therefore, all technical staff in the art Pass through the available technology of logical analysis, reasoning, or a limited experiment under this invention's idea on the basis of existing technology Scheme, all should be in the protection domain being defined in the patent claims.

Claims

1. a kind of man-machine mixing interactive system based on audio frequency is it is characterised in that include sound identification module, phonetic synthesis mould Block, semantics recognition module and exception processing module, wherein, described sound identification module is configured to and described semantics recognition mould Block is connected and transmits the corresponding Word message of voice, and described exception processing module is configured to and described sound identification module and institute Predicate justice identification module is connected, and described sound identification module is configured to transmit Word message to described exception processing module, institute Predicate justice identification module is configured to transmit semantic analysis result to described exception processing module；Described exception processing module is joined It is set to and be connected with described voice synthetic module and transmit intervention information.

2. the man-machine mixing interactive system based on audio frequency as claimed in claim 1 is it is characterised in that described sound identification module Including signal processing and feature extraction unit, acoustic model, language model and decoder, wherein, described signal processing and spy Levy extraction unit to be configured to be connected and transmit acoustic featuress information with described acoustic model, described decoder is configured to and institute State acoustic model to be connected with described language model and export recognition result.

3. the man-machine mixing interactive system based on audio frequency as claimed in claim 1 is it is characterised in that described voice synthetic module Including text analysis unit, prosodic control unit and synthesis voice unit, wherein, described text analysis unit is configured to connect Receive text message and described text message is processed, result is transferred to described prosodic control unit and described synthesis Voice unit, described prosodic control unit be configured to described synthesis voice unit be connected, and transmit pitch, the duration of a sound, loudness of a sound, Pause and prosody information, described synthesis voice unit be configured to by described receive text analysis unit analysis result with described The voice of the control parameter synthesis output of prosodic control unit.

4. the man-machine mixing interactive system based on audio frequency as claimed in claim 1 is it is characterised in that described semantics recognition module Including field mark unit, intention determination unit, information extraction unit, wherein, described field mark unit is configured to and institute State intention determination unit to be connected and transmission field information, described intention determination unit is configured to and described information extraction unit phase Connect and transmit user intent information, described information extraction unit exports the result of semantic analysis.

5. the man-machine mixing interactive system based on audio frequency as claimed in claim 1 is it is characterised in that described exception processing module Including abnormality detecting unit, data base querying unit and intervention Shi Danyuan, wherein, described abnormality detecting unit is configured to connect Receive described sound identification module and the output of described semantics recognition module, and decide whether to take intervening measure, described data base Query unit is configured to receive the intervention signal of described abnormality detecting unit, and receives the semantic letter of described semantics recognition module Breath, inquires about and exports intervention message, and it is defeated to described data base querying unit that described intervention Shi Danyuan is configured to, with intervention teacher The described intervention message going out carry out necessary preferentially and modification, final output replying message to user.

6. a kind of man-machine mixing exchange method based on audio frequency is it is characterised in that comprise the following steps：

Voice messaging is converted to Word message and exports to described semantics recognition unit by step 2, described sound identification module；

Step 4, described exception processing module are according to the Word message of described sound identification module and described semantics recognition module Semantic information judge whether human computer conversation reply that is abnormal and being directed to abnormality processing message currently.

7. the man-machine mixing exchange method based on audio frequency as claimed in claim 6 it is characterised in that in step 2, specifically wraps Include following steps：

Step 2.1, extract feature from the audio stream of input and supply acoustic model to process, reduce environment noise, channel and say simultaneously The impact that words people's factor causes to described feature；

, according to acoustics, linguistic model and dictionary, the result to described acoustic model, searching can for step 2.2, decoder Export the word string of described audio stream with maximum of probability, as the recognition result of voice.

8. the man-machine mixing exchange method based on audio frequency as claimed in claim 6 it is characterised in that in step 3, specifically wraps Include following steps：

Step 3.2, rule-based in described field user view is judged；

Step 3.3, according to described field and described user view, binding rule, specific key message is extracted.

9. the man-machine mixing exchange method based on audio frequency as claimed in claim 6 it is characterised in that in step 4, specifically wraps Include following steps：

Step 4.1, abnormality detecting unit are according to the Word message of described sound identification module and described semantics recognition module Semantic information judges whether current human computer conversation exception, if abnormal, takes over human computer conversation by intervening Shi Danyuan；

Step 4.2, data base querying unit carry out the inquiry of data base according to semantic information, and the intervention obtaining having recommendation degree disappears Breath, if the recommendation degree of intervention message is higher, is directly intervened using this intervention message, if recommendation degree is relatively low, please Seek intervention Shi Jinhang manpower intervention；

Step 4.3, when machine algorithm cannot find the intervention message of high recommendations degree, intervene teacher and intervene the choosing carrying out intervention message Select and change, subsequently send amended intervention message to client.

10. the man-machine mixing exchange method based on audio frequency as described in claim 6 or 8 is it is characterised in that described key message Including dialogue field, dialogue key word, described dialogue key word includes content keyword and emotion key word.