CN110491383A

CN110491383A - A kind of voice interactive method, device, system, storage medium and processor

Info

Publication number: CN110491383A
Application number: CN201910910484.6A
Authority: CN
Inventors: 陈孝良; 丁玉江; 李智勇
Original assignee: Beijing Sound Intelligence Technology Co Ltd
Current assignee: Beijing Sound Intelligence Technology Co Ltd; Beijing SoundAI Technology Co Ltd
Priority date: 2019-09-25
Filing date: 2019-09-25
Publication date: 2019-11-22
Anticipated expiration: 2039-09-25
Also published as: CN110491383B

Abstract

The invention discloses a kind of voice interactive method, device, system, storage medium and processors, this method comprises: obtaining input voice flow, input voice flow is distributed to each speech recognition engine and carries out speech recognition, chooses target voice recognition result obtaining each speech recognition result；Target voice recognition result is distributed to each natural language processing engine, target semanteme processing result is chosen in obtaining each semantic processes result；Input voice flow is replied according to target semanteme processing result.In the above method, target voice recognition result is filtered out in each speech recognition result, it is distributed to multiple natural language processing engines, target semanteme processing result is chosen in obtained each semantic processes result, interactive voice process is avoided to be handled using single ASR, NLP, TTS, limitation is bigger, if the problem of ASR and/or NLP identification is not allowed, influences interactive voice.

Description

A kind of voice interactive method, device, system, storage medium and processor

Technical field

The present invention relates to human-computer interaction technique field more particularly to a kind of voice interactive method, device, system, storage Jie Matter and processor.

Background technique

During interactive voice, the voice data of intelligent sound box acquisition input, by speech recognition ASR (Automatic Speech Recognition) after the text recognized is sent to natural language processing NLP (NaturalLanguage Processing), voice after semantic understanding end side is returned to using speech synthesis technique TTS (Text To Speech) to broadcast It puts.

Existing interactive voice process is to be handled using single ASR, NLP, TTS input voice flow, limitation Bigger, if ASR early period identification is inaccurate, while influencing whether that the understanding of NLP or ASR identification are accurate, NLP understands not enough meeting Influence entire interactive voice process.

Summary of the invention

In view of this, the present invention provides a kind of infrastructure services method and device based on block chain, it is existing to solve Interactive voice process be mostly single ASR, NLP, TTS processing, limitation is bigger, for example ASR early period identification not Standard, while influencing whether that the understanding of NLP or ASR identification are accurate, if NLP understands not enough, equally influence whether entire voice The problem of interactive process, concrete scheme are as follows:

A kind of voice interactive method, comprising:

Input voice flow is obtained, the input voice flow is distributed to each target voice identification engine and carries out voice knowledge Not, each speech recognition result is obtained；

Target voice recognition result is chosen in each speech recognition result；

The target voice recognition result is distributed to each target natural language processing engine, obtains each semantic processes As a result；

Target semanteme processing result is chosen in each semantic processes result；

The input voice flow is replied according to the target semanteme processing result.

Above-mentioned method optionally chooses target voice recognition result in each speech recognition result, comprising:

Obtain the discrimination of each speech recognition result；

Using the highest recognition result of discrimination in each discrimination as target identification result.

Above-mentioned method optionally chooses target semanteme processing result in each semantic processes result, comprising:

Obtain the confidence level of each semantic processes result；

Using the highest semantic processes result of confidence level in each confidence level as target semanteme processing result.

Above-mentioned method optionally replys the input voice flow according to the target semanteme processing result, packet It includes:

Obtain with the matched target retro of the target semanteme processing result and determine the use of the generation input voice flow Family group；

According to the user group, target voice Compositing Engine is determined；

The target retro is converted into output voice flow by the target voice Compositing Engine.

Above-mentioned method, optionally, the determining user group for generating the input voice flow, comprising:

Obtain the type and/or face speech recognition for identifying the target voice identification engine of the target voice recognition result As a result；

According to the type and/or the face speech recognition result, the user group is determined.

A kind of voice interaction device, comprising:

The input voice flow is distributed to each target voice for obtaining input voice flow by acquisition and identification module It identifies that engine carries out speech recognition, obtains each speech recognition result；

Speech recognition result chooses module, for choosing target voice identification knot in each speech recognition result Fruit；

Processing module is obtained for the target voice recognition result to be distributed to each target natural language processing engine To each semantic processes result；

Processing result chooses module, for choosing target semanteme processing result in each semantic processes result；

Module is replied, for replying according to the target semanteme processing result the input voice flow.

Above-mentioned device, optionally, the reply module includes:

Acquisition and determination unit, for obtaining and the matched target retro of the target semanteme processing result and determining generation The user group of the input voice flow；

Determination unit, for determining target voice Compositing Engine according to the user group；

Converting unit, for the target retro to be converted to output voice flow by the target voice Compositing Engine.

A kind of voice interactive system, comprising: Cloud Server, speech recognition module, semantic processes module, technical ability module, language Sound synthesis module and intelligent sound terminal, wherein

The Cloud Server is used to obtain the input voice flow of the intelligent sound terminal acquisition, by the input voice flow It is distributed to the speech recognition module and carries out speech recognition, obtain target voice recognition result；

The target voice recognition result is sent to the Cloud Server, the Cloud Server by the speech recognition module By semantic processes module described in the target voice recognition result, target semanteme processing result is obtained；

The target semanteme processing result is sent to the Cloud Server, the Cloud Server by the semantic processes module The target semanteme processing result is sent to the technical ability module, obtains target retro；

The target retro is sent to the Cloud Server by the technical ability module, and the Cloud Server returns the target The voice synthetic module is given in recurrence, obtains output voice flow；

The output voice flow is sent to the Cloud Server by the voice synthetic module, and the Cloud Server will be described Output voice flow is sent to the intelligent sound terminal and plays out.

A kind of storage medium, the storage medium include the program of storage, wherein described program executes a kind of above-mentioned language Sound exchange method.

A kind of processor, the processor is for running program, wherein described program executes a kind of above-mentioned language when running Sound exchange method.

Compared with prior art, the present invention includes the following advantages:

The invention discloses a kind of voice interactive method, device, system, storage medium and processors, this method comprises: obtaining Input voice flow is taken, input voice flow is distributed to each speech recognition engine and carries out speech recognition, is known obtaining each voice Other result chooses target voice recognition result；Target voice recognition result is distributed to each natural language processing engine, Target semanteme processing result is chosen into each semantic processes result；Input voice flow is carried out according to target semanteme processing result It replys.In the above method, target voice recognition result is filtered out in each speech recognition result, is distributed to multiple natures Language processing engine chooses target semanteme processing result in obtained each semantic processes result, avoids interactive voice mistake Cheng Caiyong single ASR, NLP, TTS is handled, and limitation is bigger, if ASR and/or NLP identification is inaccurate, influences voice friendship Mutual problem.

Certainly, it implements any of the products of the present invention and does not necessarily require achieving all the advantages described above at the same time.

Detailed description of the invention

In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of invention for those of ordinary skill in the art without creative efforts, can be with It obtains other drawings based on these drawings.

Fig. 1 is a kind of voice interactive method flow chart disclosed in the embodiment of the present application；

Fig. 2 is a kind of another flow chart of voice interactive method disclosed in the embodiment of the present application；

Fig. 3 is a kind of voice interactive system structural block diagram disclosed in the embodiment of the present application；

Fig. 4 is a kind of voice interaction device structural block diagram disclosed in the embodiment of the present application.

Specific embodiment

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall within the protection scope of the present invention.

The foregoing description of the disclosed embodiments enables those skilled in the art to implement or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, as defined herein General Principle can be realized in other embodiments without departing from the spirit or scope of the present invention.Therefore, of the invention It is not intended to be limited to the embodiments shown herein, and is to fit to and the principles and novel features disclosed herein phase one The widest scope of cause.

It the invention discloses a kind of voice interactive method and device, applies during interactive voice, existing voice is handed over Input voice flow is handled by single ASR, NLP, TTS during mutually, if the speech recognition result and/or NLP of ASR Natural language processing result and corresponding actual result deviation it is larger, it may appear that the case where giving an irrelevant answer influences interactive voice Process, the present invention provides a kind of voice interactive methods for solving the above problems, and the execution process of the exchange method is such as Shown in Fig. 1, comprising steps of

S101, input voice flow is obtained, the input voice flow is distributed to each target voice identification engine and carries out language Sound identification, obtains each speech recognition result；

In the embodiment of the present invention, the input voice flow is obtained from intelligent sound equipment, the intelligent sound equipment It can be intelligent sound box, intelligent sound robot, smart phone etc., the language that the intelligent sound equipment acquisition user issues Sound is converted into input voice flow, and the input voice flow is distributed to each target voice identification engine and is identified, is obtained To each speech recognition result.

Wherein, it is illustrated for the process of distributing, if in system including 10 speech recognition engines, the target language Sound identify engine quantity can be less than or equal to 10, such as: can all regard above-mentioned 10 speech recognition engines as target The quantity of speech recognition engine, i.e., the described speech recognition engine is equal with the target voice identification quantity of shade, will be described Input voice flow is distributed to above-mentioned 10 target voices identification engine and carries out speech recognition, but this processing mode is to processor It is more demanding, when the configuration of processor cannot be met the requirements, the speed that will lead to speech recognition is slow, so influence voice Interactive process causes user experience during interactive voice bad, therefore, in order to improve the speed of speech recognition, Ke Yi It is distributed to before speech recognition engine, the type of the input voice flow is obtained, according to the type to 10 above-mentioned voices Identification engine is screened, and no less than two target voice identification engines, the quantity of the engine of target voice identification at this time are obtained It can be less than or equal to 10.Wherein, the type can according to actual scene, vertically segment field, such as: the classification can be with Classified by language, can also be carried out by professional domain classification or other scenes classify, wherein divided by language Class can be subdivided into Chinese and foreign language, and Chinese can be subdivided into mandarin and dialect again, can also be directed to according to particular situation Dialect continues to segment, and foreign language can be English, Japanese, Korean etc., can also classify by professional domain, such as: computer Perhaps machinery field etc. can also be according to for computer field, the communications field or machinery field etc. for field, the communications field Continue to segment according to concrete condition, details are not described herein, can also also implement comprising other zoned formats, the present invention certainly In example, to the concrete form of the type without limiting.

S102, target voice recognition result is chosen in each speech recognition result；

In the embodiment of the present invention, engine is identified for each target voice, in output and the input voice flow pair The discrimination of the recognition result can be also exported while the recognition result answered, discrimination can be because of signal-to-noise ratio, on-line/off-line identification etc. Difference can be generated, therefore, it is necessary to obtain whether the signal-to-noise ratio of the input voice flow and target voice identification engine wait shadows online After the factor for ringing discrimination, discrimination of the input voice flow in the case where corresponding target voice identifies engine is being determined.

In real work, the direct indicator of general discrimination be Word Error Rate WER (Word Error Rate) its definition such as Under: in order to make to be consistent between the word sequence identified and the word sequence of standard, needs to be replaced, deletes or be inserted into Certain words, the total number of these insertions, replacement or the word deleted, divided by the percentage of the total number of word in the word sequence of standard, As WER.

Formula are as follows:

Accuracy=100-WER% (2)

Wherein: the number for the word that S- is replaced；

D- is deleted the number of word；

The number of I- insertion word；

N- word total number；

WER- Word Error Rate；

Accuracy- discrimination；

Wherein: WER can divide situations such as men and women, speed, accent, number/English/Chinese, respectively from the point of view of because there is insertion Word, so theoretically WER is possible to be greater than 100%, but in practice, particularly when large sample size, be it is impossible, otherwise It is just too poor, it is impossible to commercial.

Further, sentence error rate SER (Sentence Error Rate) can be used, i.e. " of sentence identification mistake Several/total sentence number ".But in actual operation, general sentence error rate is 2~3 times of character error rate, so not adopting usually Identification process is measured with sentence error rate.

Using discrimination as reference in the embodiment of the present invention, the discrimination of each speech recognition result is calculated first, it will The highest speech recognition result of discrimination is as target voice recognition result in each discrimination.

S103, the target voice recognition result is distributed to each target natural language processing engine, obtains each language Adopted processing result；

In the embodiment of the present invention, the target voice recognition result is distributed to each target and handles engine naturally, wherein It is illustrated for the process of distributing, if in system including 10 natural language processing engines, at the target natural language The quantity of engine is managed less than or equal to 10, such as: above-mentioned 10 natural language processing engines can be all used as target natural Language processing engine, i.e., the quantity of the described target natural language processing engine are equal to the quantity of the natural language processing engine, But this processing mode, when the configuration of processor cannot be met the requirements, will lead to voice knowledge to the more demanding of processor Other speed is slow, and then influences the process of interactive voice, causes user experience during interactive voice bad, therefore, voice Interactive speed can determine institute before the target identification result is distributed to each target natural language processing engine State target identification resulting class, wherein the classification can determine according to actual scene, vertical subdivision field, such as: institute State classification can be classified by language and also by professional domain carry out classification or other scenes classify, wherein press Language, which carries out classification, can be subdivided into Chinese and foreign language, and Chinese can be subdivided into mandarin and dialect again, according to particular situation It can also continue to segment for dialect, foreign language can be English, Japanese, Korean etc., can also classify by professional domain, example Such as: computer field, the communications field perhaps machinery field for computer field, the communications field or machinery field etc. Deng can also continue to segment according to concrete condition, details are not described herein, certainly also can also comprising other zoned formats, In the embodiment of the present invention, to the concrete form of classification without limiting, it is preferred that for target voice identification engine and institute Stating the classification of target natural language processing engine, there are corresponding relationships.For example, if the target voice recognition result is to pass through needle The target voice identification engine of dialect is obtained, can be directly distributed to the target natural language processing engine of dialect i.e. It can.

S104, target semanteme processing result is chosen in each semantic processes result；

In the embodiment of the present invention, for each target natural language processing engine, in output and the target voice The confidence level of the target semanteme processing result can be also exported while recognition result corresponding target semanteme processing result, with described Target natural language processing engine be Baidu NLP semantic computation general frame for, mainly divide three parts, bottom relies on Big data, web data and user behavior data and High-Performance Computing Cluster (GPU, CPU and FPGA) have been made based on DNN and general The target natural language processing engine of rate graph model, by entering the target voice recognition result to target natural language processing Engine, available target semanteme processing result, wherein the target semanteme processing result is for the input voice flow Text is replied, and then based on the semantic processes as a result, carrying out the calculating of semantic level, including semantic matches, semantic retrieval, text This classification, sequence generation and sequence labelling etc., so that it is determined that the confidence level of semantic processes result, due to different target nature language The determination method difference to confidence level of speech processing engine, may cause between each confidence level and does not have referential, will be described Each confidence level be normalized or other processing after be compared, by the highest semantic processes of confidence level in each confidence level As a result it is used as target semanteme processing result.

S105, the input voice flow is replied according to the target semanteme processing result.

It is by target language described in text using speech synthesis TTS (Text-To-Speech) technology in the embodiment of the present invention Adopted processing result is converted into output voice flow, and reads out by the way that the intelligent sound equipment is bright, is analogous to the mouth of the mankind.Example Such as: the sound heard in the various voice assistants of Siri is generated by TTS.

The invention discloses a kind of voice interactive methods, comprising: obtains input voice flow, input voice flow is distributed to respectively A speech recognition engine carries out speech recognition, chooses target voice recognition result obtaining each speech recognition result；By target Speech recognition result is distributed to each natural language processing engine, chooses at target semanteme in obtaining each semantic processes result Manage result；Input voice flow is replied according to target semanteme processing result.In the above method, in each speech recognition result In filter out target voice recognition result, multiple natural language processing engines are distributed to, in obtained each semantic processes As a result target semanteme processing result is chosen in, is avoided interactive voice process and is handled using single ASR, NLP, TTS, office It is sex-limited bigger, if the problem of ASR and/or NLP identification is not allowed, influences interactive voice.

In the embodiment of the present invention, according to the target semanteme processing result to the processing for inputting voice flow and being replied Process as shown in Fig. 2, comprising steps of

S201, it obtains and the matched target retro of the target semanteme processing result and the determining generation input voice flow User group；

In the embodiment of the present invention, the keyword in the target semanteme processing result is obtained, is determined according to the keyword Technical ability unit corresponding with the target semanteme processing result receives handling for the target voice for technical ability unit feedback Terminal objective is replied.Obtain the type and/or face language for identifying the target voice identification engine of the target voice recognition result Sound recognition result determines the user for generating the input voice flow according to the type and/or the face speech recognition result Group, the user group can be men and women, old and young, kinsfolk or the voice sender for using certain dialect or languages Deng.

S202, according to the user group, determine target voice Compositing Engine；

In the embodiment of the present invention, speech synthesis engine selection can also in conjunction with actual scene, vertical subdivision field into Row divides, and according to the target group, determines target voice Compositing Engine, such as: the target voice Compositing Engine can be by Language, which carries out classification, can be subdivided into Chinese and foreign language, and Chinese can be subdivided into mandarin and dialect again, according to particular situation It can also continue to segment for dialect, foreign language can be English, Japanese, Korean etc., in the embodiment of the present invention, to the specific of classification Form is without limiting.Such as: if the user group is the sender of dialect, therefore target voice identification engine can be adopted With target voice corresponding with dialect type identify engine, then can directly according to dialect type select speech synthesis engine as Target voice Compositing Engine.

S203, the target retro is converted into output voice flow by the target voice Compositing Engine.

In the embodiment of the present invention, the target retro is converted into output voice by the target voice Compositing Engine The type of stream, the target voice Compositing Engine is different, and the mode of reply is different.The target voice Compositing Engine can also be according to It is identified according to face recognition technology by user's portrait, such as: the intelligent sound terminal is recognized according to face recognition technology Received be input voice flow is mother's word, and is analyzed to obtain mother by historical record or the reply rule of setting Mother most wants to hear the sound of son, at this point, target voice Compositing Engine can be sent out the target retro using the sound of son Be sent to the intelligent sound terminal, certainly also can also according to particular situation by the target retro by English, dialect or Person's others mode is sent to the intelligent sound terminal.

Based on a kind of above-mentioned voice interactive method, a kind of voice interactive system is provided in the embodiment of the present invention, it is described The structural block diagram of interactive system is as shown in Figure 3, comprising: Cloud Server 301, speech recognition module 302, semantic processes module 303, Technical ability module 304, voice synthetic module 305 and intelligent sound terminal 306, wherein

The Cloud Server 301 is used to obtain the input voice flow that the intelligent sound terminal 306 acquires, by the input Voice flow is distributed to the speech recognition module 302 and carries out speech recognition, obtains target voice recognition result；

In the embodiment of the present invention, the speech recognition module 302 includes multiple speech recognition engines, it is preferred that in order to mention High recognition efficiency can preferentially screen multiple speech recognition engines in speech recognition process, obtain multiple target languages Sound identifies engine, carries out speech recognition according to multiple target voices identification engine, selects in obtained each speech recognition result Take the highest speech recognition result of discrimination as target voice recognition result.

The target voice recognition result is sent to the Cloud Server 301, the cloud by the speech recognition module 302 Semantic processes module 303 described in the target voice recognition result is obtained target semanteme processing result by server 301；

In the embodiment of the present invention, the speech recognition module 303 includes multiple natural language processing engines, it is preferred that is Example improves treatment effeciency, can screen, obtain more to multiple natural language processing engines during natural language processing The target voice recognition result is sent to multiple target natural language processing processing and drawn by a target natural language processing engine It holds up, the highest semantic processes result of confidence level is chosen in obtained multiple semantic processes results as target semantic processes knot Fruit.

The target semanteme processing result is sent to the Cloud Server 301, the cloud by the semantic processes module 303 The target semanteme processing result is sent to the technical ability module 304 by server 301, obtains target retro.

In the embodiment of the present invention, the technical ability module 304 according to the target semanteme processing result according to concrete condition into Row processing, is replied if necessary to the intelligent sound terminal 306, then the result returned is target retro, if it is control Instruction then continues to be handled in the technical ability module 304.The present invention is directed to the feelings returned the result as target retro in implementing Condition is illustrated.Such as: user says " air-conditioning for opening parlor " that target voice recognition result is exactly " to open the sky in parlor Adjust ", " field is air-conditioning, and instruction is to open, and specific location is parlor ", Cloud Server are translated into after natural language understanding 304 can distribute result in the technical ability module 304 in technical ability corresponding with air-conditioning according to field, and the technical ability of air-conditioning is according to finger It enables and position, then can be opened the air-conditioning in parlor by controlling, return to target retro after success, such as the target retro can be with For " good, parlor air-conditioning has already turned on ".

The target retro is sent to the Cloud Server 301 by the technical ability module 304, and the Cloud Server 301 will The target retro is sent to the voice synthetic module 305, obtains output voice flow；

The output voice flow is sent to the Cloud Server 301, the Cloud Server by the voice synthetic module 305 The output voice flow is sent to the intelligent sound terminal 306 to play out.

Based on a kind of above-mentioned voice interactive method, a kind of voice interaction device is provided in the embodiment of the present invention, it is described The structural block diagram of interactive device is as shown in Figure 4, comprising:

It obtains and identification module 401, speech recognition result chooses mould 402, processing module 403, processing result and choose mould 404 With reply module 405.

Wherein,

The input voice flow is distributed to each mesh for obtaining input voice flow by the acquisition and identification module 401 It marks speech recognition engine and carries out speech recognition, obtain each speech recognition result；

Institute's speech recognition result chooses module 402, for choosing target voice in each speech recognition result Recognition result；

The processing module 403, for the target voice recognition result to be distributed to each target natural language processing Engine obtains each semantic processes result；

The processing result chooses module 404, for choosing target semantic processes in each semantic processes result As a result；

The reply module 405, for replying according to the target semanteme processing result the input voice flow.

The invention discloses a kind of voice interaction devices, comprising: obtains input voice flow, input voice flow is distributed to respectively A speech recognition engine carries out speech recognition, chooses target voice recognition result obtaining each speech recognition result；By target Speech recognition result is distributed to each natural language processing engine, chooses at target semanteme in obtaining each semantic processes result Manage result；Input voice flow is replied according to target semanteme processing result.In above-mentioned apparatus, in each speech recognition result In filter out target voice recognition result, multiple natural language processing engines are distributed to, in obtained each semantic processes As a result target semanteme processing result is chosen in, is avoided interactive voice process and is handled using single ASR, NLP, TTS, office It is sex-limited bigger, if the problem of ASR and/or NLP identification is not allowed, influences interactive voice.

In the embodiment of the present invention, the reply module 405 includes:

It obtains and determination unit 406, determination unit 407 and converting unit 408.

Wherein,

It is described acquisition and determination unit 406, for obtain with the matched target retro of the target semanteme processing result and Determine the user group for generating the input voice flow；

The determination unit 407, for determining target voice Compositing Engine according to the user group；

The converting unit 408, for the target retro to be converted to output by the target voice Compositing Engine Voice flow.

The voice interaction device includes processor and memory, and above-mentioned acquisition and identification module, speech recognition result are selected Modulus, processing module, processing result are chosen mould and reply module etc. and are stored as program unit in memory, by processor Above procedure unit stored in memory is executed to realize corresponding function.

Include kernel in processor, is gone in memory to transfer corresponding program unit by kernel.Kernel can be set one Or more, target voice recognition result is filtered out in each speech recognition result, by the target voice recognition result Multiple natural language processing engines are distributed to, target semanteme processing result is chosen in each semantic processes result, is avoided Interactive voice process is handled using single ASR, NLP, TTS, and limitation is bigger, if ASR and/or NLP identification is not Standard, the problem of influencing whether entire interactive voice process.

Memory may include the non-volatile memory in computer-readable medium, random access memory (RAM) and/ Or the forms such as Nonvolatile memory, if read-only memory (ROM) or flash memory (flash RAM), memory include that at least one is deposited Store up chip.

The embodiment of the invention provides a kind of storage mediums, are stored thereon with program, real when which is executed by processor The existing voice interactive method.

The embodiment of the invention provides a kind of processor, the processor is for running program, wherein described program operation Voice interactive method described in Shi Zhihang.

The embodiment of the invention provides a kind of equipment, equipment include processor, memory and storage on a memory and can The program run on a processor, processor perform the steps of when executing program

Target voice recognition result is chosen in each speech recognition result；

Obtain the discrimination of each speech recognition result；

Obtain the confidence level of each semantic processes result；

According to the user group, target voice Compositing Engine is determined；

Equipment herein can be server, PC, PAD, mobile phone etc..

Present invention also provides a kind of computer program products, when executing on data processing equipment, have been adapted for carrying out The program of following method and step:

Target voice recognition result is chosen in each speech recognition result；

Obtain the discrimination of each speech recognition result；

Obtain the confidence level of each semantic processes result；

According to the user group, target voice Compositing Engine is determined；

It should be noted that all the embodiments in this specification are described in a progressive manner, each embodiment weight Point explanation is the difference from other embodiments, and the same or similar parts between the embodiments can be referred to each other. For device class embodiment, since it is basically similar to the method embodiment, so being described relatively simple, related place ginseng See the part explanation of embodiment of the method.

Finally, it is to be noted that, herein, relational terms such as first and second and the like be used merely to by One entity or operation are distinguished with another entity or operation, without necessarily requiring or implying these entities or operation Between there are any actual relationship or orders.Moreover, the terms "include", "comprise" or its any other variant meaning Covering non-exclusive inclusion, so that the process, method, article or equipment for including a series of elements not only includes that A little elements, but also including other elements that are not explicitly listed, or further include for this process, method, article or The intrinsic element of equipment.In the absence of more restrictions, the element limited by sentence "including a ...", is not arranged Except there is also other identical elements in the process, method, article or apparatus that includes the element.

For convenience of description, it is divided into various units when description apparatus above with function to describe respectively.Certainly, implementing this The function of each unit can be realized in the same or multiple software and or hardware when invention.

As seen through the above description of the embodiments, those skilled in the art can be understood that the present invention can It realizes by means of software and necessary general hardware platform.Based on this understanding, technical solution of the present invention essence On in other words the part that contributes to existing technology can be embodied in the form of software products, the computer software product It can store in storage medium, such as ROM/RAM, magnetic disk, CD, including some instructions are used so that a computer equipment (can be personal computer, server or the network equipment etc.) executes the certain of each embodiment or embodiment of the invention Method described in part.

A kind of voice interactive method provided by the present invention, device, system, storage medium and processor are carried out above It is discussed in detail, used herein a specific example illustrates the principle and implementation of the invention, above embodiments Illustrate to be merely used to help understand method and its core concept of the invention；At the same time, for those skilled in the art, according to According to thought of the invention, there will be changes in the specific implementation manner and application range, in conclusion the content of the present specification It should not be construed as limiting the invention.

Claims

1. a kind of voice interactive method characterized by comprising

Input voice flow is obtained, the input voice flow is distributed to each target voice identification engine and carries out speech recognition, is obtained To each speech recognition result；

Target voice recognition result is chosen in each speech recognition result；

The target voice recognition result is distributed to each target natural language processing engine, obtains each semantic processes knot Fruit；

2. the method according to claim 1, wherein choosing target voice in each speech recognition result Recognition result, comprising:

Obtain the discrimination of each speech recognition result；

3. the method according to claim 1, wherein it is semantic to choose target in each semantic processes result Processing result, comprising:

Obtain the confidence level of each semantic processes result；

4. the method according to claim 1, wherein according to the target semanteme processing result to the input language Sound stream is replied, comprising:

Obtain with the matched target retro of the target semanteme processing result and determine the user group of the generation input voice flow Body；

According to the user group, target voice Compositing Engine is determined；

5. according to the method described in claim 4, it is characterized in that, the determining user group for generating the input voice flow Body, comprising:

Obtain the type and/or face speech recognition knot for identifying the target voice identification engine of the target voice recognition result Fruit；

6. a kind of voice interaction device characterized by comprising

It obtains and the input voice flow is distributed to each target voice and identified by identification module for obtaining input voice flow Engine carries out speech recognition, obtains each speech recognition result；

Speech recognition result chooses module, for choosing target voice recognition result in each speech recognition result；

Processing module obtains each for the target voice recognition result to be distributed to each target natural language processing engine A semantic processes result；

7. device according to claim 6, which is characterized in that the reply module includes:

Acquisition and determination unit for acquisition and the matched target retro of the target semanteme processing result and determine described in generation Input the user group of voice flow；

8. a kind of voice interactive system characterized by comprising Cloud Server, speech recognition module, semantic processes module, skill Energy module, voice synthetic module and intelligent sound terminal, wherein

The Cloud Server is used to obtain the input voice flow of the intelligent sound terminal acquisition, and the input voice flow is distributed Speech recognition is carried out to the speech recognition module, obtains target voice recognition result；

The target voice recognition result is sent to the Cloud Server by the speech recognition module, and the Cloud Server is by institute Semantic processes module described in target voice recognition result is stated, target semanteme processing result is obtained；

The target semanteme processing result is sent to the Cloud Server by the semantic processes module, and the Cloud Server is by institute It states target semanteme processing result and is sent to the technical ability module, obtain target retro；

The target retro is sent to the Cloud Server by the technical ability module, and the Cloud Server sends out the target retro The voice synthetic module is given, output voice flow is obtained；

The output voice flow is sent to the Cloud Server by the voice synthetic module, and the Cloud Server is by the output Voice flow is sent to the intelligent sound terminal and plays out.

9. a kind of storage medium, which is characterized in that the storage medium includes the program of storage, wherein described program right of execution Benefit require any one of 1 to 5 described in a kind of voice interactive method.

10. a kind of processor, which is characterized in that the processor is for running program, wherein right of execution when described program is run Benefit require any one of 1 to 5 described in a kind of voice interactive method.