CN106373569A

CN106373569A - Voice interaction apparatus and method

Info

Publication number: CN106373569A
Application number: CN201610806384.5A
Authority: CN
Inventors: 曹立新
Original assignee: Beijing Horizon Robotics Technology Research and Development Co Ltd
Current assignee: Beijing Horizon Robotics Technology Research and Development Co Ltd
Priority date: 2016-09-06
Filing date: 2016-09-06
Publication date: 2017-02-01
Anticipated expiration: 2036-09-06
Also published as: CN106373569B

Abstract

The invention relates to a voice interaction apparatus and method. In one embodiment of the invention, the voice interaction method can comprise the following steps: receiving first voice input from a human user and first expression image input associated with the first voice input; identifying a first meaning of the first voice input; identifying a first expression of the first expression image input; based on the first meaning and the first expression, determining first confidence associated with the first meaning; and based on the first meaning and the first confidence, generating first response information. According to the invention, the response information is generated by use of the expression and the meaning, and the method can improve the experience of the human user in a man-machine voice interaction process.

Description

Voice interaction device and method

Technical field

Present invention relates generally to field of human-computer interaction, more specifically it relates to a kind of voice interaction device and method, its energy Enough improve the accuracy of speech recognition, generate more appropriate voice answer-back, thus realizing man-machine friendship that is more intelligent and more personalizing Mutually process.

Background technology

Language is interpersonal most convenient, maximally effective communication mode, therefore it is readily conceivable that by voice communications applications To field of human-computer interaction, replace the man-machine interaction mode of traditional such as keyboard, mouse etc.Man-machine natural language dialogue meaning The spoken language that machine " can understand " mankind, here it is speech recognition technology.

Language is an art that experienced thousands of years differentiation, and it comprises the abundant information in remote superwood face, and uses language The mankind be the intelligent beings with multiple emotions again, therefore, the communication of interpersonal simple and fast is for machine Speech is probably high complexity.Although having been proposed for many technology at present to improve the accuracy of speech recognition, however, these Prior art is substantially a kind of pattern matching process, i.e. the pattern of identification received speech, by its reference with known voice Pattern is compared one by one, to determine recognition result.In these prior arts, the information that voice is comprised and and voice The utilization of related information is still less, leads to speech recognition technology sometimes can not efficiently identify the true meaning of human user Think to represent.For example, it is possible that irony, the words said in a fit of rage, not knowing where the tone, these are all for interpersonal speech exchange Through the identification ability beyond existing voice technology of identification.Existing voice technology of identification can only be carried out in the form of a kind of " mechanical " Interactive voice process, hinders machinery equipment to more intelligent and direction of more personalizing development.

Accordingly, it would be desirable to a kind of improved man-machine language interactive device and method, it enables machinery equipment more accurately Understand the true intention of human user such that it is able to improve the wisdom degree of machinery equipment and the level that personalizes, more efficient topotype Intend interpersonal communication process, and improve the interactive experience of human user.

Content of the invention

One aspect of the present invention is in man machine language's interaction, by using more information, to make machine Equipment can more accurately understand the true intention of human user.

The present invention one exemplary embodiment provides a kind of voice interactive method, and it mays include: and receives from human user First phonetic entry and the first facial expression image input being associated with described first phonetic entry；Identify described first phonetic entry First semantic；Identify the first expression of described first facial expression image input；Based on the described first semantic and described first expression Determine the first confidence level being associated with the described first semanteme；And it was based on for the described first semantic and described first confidence level next life Become the first response message.

In one example, determine that the first confidence level being associated with the described first semanteme is mayd include: semantic for described first Distribution one acquiescence confidence level；And express one's feelings to adjust described acquiescence confidence level based on described first.

In one example, determine that the first confidence level being associated with the described first semanteme be may also include that based on interactive voice Context adjusting described acquiescence confidence level.

In one example, express one's feelings based on described first and express one's feelings when described first to adjust described acquiescence confidence level and to may include: When being certainty expression, increase described acquiescence confidence level；When the described first expression is negativity expression, reduces described acquiescence and put Reliability；And when the described first expression is in addition to the neutral expression outside described certainty expression is expressed one's feelings with described negativity, Maintain described acquiescence confidence level constant.

In one example, described certainty expression may include happy, pleasantly surprised, anxious, serious, and described negativity expression can be wrapped Include indignation, detest, disdain, frightened, grieved, hesitate, surprised, suspect.

In one example, determine that the first confidence level being associated with the described first semanteme be may also include that and judge described first Whether containing key word of being in a bad mood in semanteme；If not containing key word of being in a bad mood in described first semanteme, execution is described to be based on institute State the step to adjust described acquiescence confidence level for first expression；If containing, in described first semanteme, key word of being in a bad mood, judge Whether described emotion key word is mated with the described first expression；If described emotion key word is matched with the described first expression, Then increase described acquiescence confidence level；And if described emotion key word is mismatched with described first expression, then execute described base In the described first step expressed one's feelings to adjust described acquiescence confidence level.

In one example, determine that the first confidence level being associated with the described first semanteme be may also include that and judge described first Semantic semantic type；If the semantic type of described first semanteme is query, increase described acquiescence confidence level；And if The semantic type of described first semanteme is to state or require, then adjust described acquiescence based on described first expression described in execution and put The step of reliability.

In one example, determine that the first confidence level being associated with the described first semanteme be may also include that and judge described first Semantic semantic type；If the semantic type of described first semanteme is query, increase described acquiescence confidence level；And if Whether the semantic type of described first semanteme is statement or requires, then judge in described first semanteme containing key word of being in a bad mood；As Key word of being in a bad mood is not contained, then execution is described expresses one's feelings to adjust described acquiescence confidence based on described first in really described first semanteme The step of degree；If containing, in described first semanteme, key word of being in a bad mood, judge described emotion key word and described first expression Whether mate；If described emotion key word is matched with the described first expression, increase described acquiescence confidence level；And if Described emotion key word is expressed one's feelings with described first and is mismatched, then adjust described acquiescence based on described first expression described in execution and put The step of reliability.

In one example, generate the first response message based on the described first semantic and described first confidence level to may include: When described first confidence level is more than predetermined threshold, then generate the of the content including being directly associated with described first semanteme One response message；When described first confidence level is less than described predetermined threshold, then generate request described human user confirmation described First response message of the first semanteme.

In one example, also may be used when described first confidence level is less than the first response message being generated during described predetermined threshold Including the content with the described first semantic indirect correlation connection.

In one example, generate the first response message based on the described first semantic and described first confidence level to may include: When described first confidence level is more than predetermined threshold, then generate the of the content including being directly associated with described first semanteme One response message；When described first confidence level is less than described predetermined threshold, then by described first confidence level and the second confidence level Compare, described second confidence level is the just phonetic entry before described first phonetic entry with described human user Associated confidence level；If described first confidence level is more than described second confidence level, generates and ask described human user Confirm the first response message of described first semanteme；And if described first confidence level is less than described second confidence level, then give birth to The described human user of request is become to confirm that described first is semantic and include with the content of the described first semantic indirect correlation connection the One response message.

In one example, methods described may also include and responds described first according to the described first corresponding tone of expression Information synthesizes voice to play to described human user.

Another exemplary embodiment of the present invention provides a kind of voice interaction device, and it mays include: sound identification module, configuration It is derived from the first semanteme of the first phonetic entry of human user for identification；Picture recognition module, is configured to identification and is derived from described people First expression of the first facial expression image input being associated with described first phonetic entry of class user；Confidence level unit, configuration It is based on described first the first confidence level that semantic and described first expression to determine and the described first semanteme is associated；And ring Answer signal generating unit, be configured to the described first semantic and described first confidence level and generate the first response message.

In one example, described confidence level unit may be configured to execute following steps to determine and the described first semanteme The first associated confidence level: for the described first semantic distribution one acquiescence confidence level；And express one's feelings to adjust based on described first Described acquiescence confidence level.

In one example, described confidence level unit may be additionally configured to determine and described first language by executing following steps The first associated confidence level of justice: the context based on interactive voice is adjusting described acquiescence confidence level.

In one example, described confidence level unit may be configured to execute following steps to adjust based on the described first expression Whole described acquiescence confidence level: when the described first expression is certainty expression, increase described acquiescence confidence level；When described first table When feelings are negativity expressions, reduce described acquiescence confidence level；And when described first expression be in addition to described certainty expression and During neutral expression outside described negativity expression, maintain described acquiescence confidence level constant.

In one example, described confidence level unit may be additionally configured to determine and described first language by executing following steps The first associated confidence level of justice: whether judge in described first semanteme containing key word of being in a bad mood；If in described first semanteme Without key word of being in a bad mood, then execute the described step that described acquiescence confidence level is adjusted based on the described first expression；If institute State in the first semanteme and contain key word of being in a bad mood, then judge whether described emotion key word is mated with the described first expression；If institute State emotion key word to match with the described first expression, then increase described acquiescence confidence level；And if described emotion key word Mismatch with the described first expression, then execute the described step that described acquiescence confidence level is adjusted based on the described first expression.

In one example, described confidence level unit may be additionally configured to determine and described first language by executing following steps The first associated confidence level of justice: judge the semantic type of described first semanteme；If the semantic type of described first semanteme is Query, then increase described acquiescence confidence level；And if the semantic type of described first semanteme is statement or requires, then execute institute State based on the described first step expressed one's feelings to adjust described acquiescence confidence level.

In one example, described response generation module may be configured to execute following steps to generate described first response Information: when described first confidence level is more than predetermined threshold, then generate in including being directly associated with described first semanteme The first response message held；When described first confidence level is less than described predetermined threshold, then generates and ask described human user true Recognize the first response message of described first semanteme.

In one example, when described first confidence level is less than described predetermined threshold, described response generation module generates First response message may also include the content with the described first semantic indirect correlation connection.

In one example, described response generation module may be configured to execute following steps to generate described first response Information: when described first confidence level is more than predetermined threshold, then generate in including being directly associated with described first semanteme The first response message held；When described first confidence level is less than described predetermined threshold, then by described first confidence level and second Confidence level compares, and described second confidence level is the just language before described first phonetic entry with described human user The associated confidence level of sound input；If described first confidence level is more than described second confidence level, generates and ask described people Class user confirms the first response message of described first semanteme；And if described first confidence level is less than described second confidence Degree, then generate the described human user of request and confirm that described first is semantic and include and the described first semantic indirect correlation connection First response message of content.

In one example, described device may also include that voice synthetic module, is configured to according to corresponding with the described first expression The tone described first response message is synthesized voice to play to described human user.

Another exemplary embodiment of the present invention provides a kind of electronic equipment, and it mays include: voice receiving unit；Image-receptive Unit；Memorizer；And processor, by bus system and described voice receiving unit, described image receiving unit and described Memorizer is connected to each other, and described processor is configured to run the instruction being stored on described memorizer to execute previously described side Any one in method.

Another exemplary embodiment of the present invention provides a kind of computer program, and it may include computer program instructions, Described computer program instructions can make any in the previously described method of described computing device when being run by processor A kind of.

Another exemplary embodiment of the present invention provides a kind of computer-readable recording medium, and can be stored with computer journey thereon Sequence instructs, and described computer program instructions can make when being run by processor in the previously described method of described computing device Any one.

Brief description

By combining accompanying drawing, the embodiment of the present application is described in more detail, the above-mentioned and other purposes of the application, Feature and advantage will be apparent from.Accompanying drawing is used for providing the embodiment of the present application is further understood, and constitutes explanation A part for book, is used for explaining the application together with the embodiment of the present application, does not constitute the restriction to the application.In the accompanying drawings, Identical reference number typically represents same parts or step.

Fig. 1 is the flow chart illustrating the voice interactive method according to the present invention one exemplary embodiment.

Fig. 2 is to illustrate to determine the process of confidence level according to the present invention one exemplary embodiment based on semantic and expression Flow chart.

Fig. 3 is to illustrate to determine the process of confidence level according to another exemplary embodiment of the present invention based on semantic and expression Flow chart.

Fig. 4 is to illustrate to determine the process of confidence level according to another exemplary embodiment of the present invention based on semantic and expression Flow chart.

Fig. 5 is the mistake illustrating to generate based on semantic and confidence level response message according to the present invention one exemplary embodiment The flow chart of journey.

Fig. 6 is the block diagram illustrating the voice interaction device according to the present invention one exemplary embodiment.

Fig. 7 is the block diagram illustrating the electronic equipment according to the present invention one exemplary embodiment.

Specific embodiment

Below, example embodiment according to the application will be described in detail by referring to the drawings.Obviously, described embodiment is only Only a part of embodiment of the application, rather than the application whole embodiments it should be appreciated that the application be not subject to described herein The restriction of example embodiment.

Fig. 1 illustrates the general block diagram of the man machine language's exchange method 100 according to the present invention one exemplary embodiment.Here, " people " can represent human user, and " machine " can represent any type of electronic equipment with human-computer interaction function, including but It is not limited to mobile electronic device, such as smart mobile phone, flat board, notebook, robot, personal digital assistant, vehicle mounted electric Sub- equipment, and non-mobile electronic equipment, such as desktop computer, information service terminal, ticket terminals, intelligent appliance equipment, Intelligent customer service equipment etc..These equipment may be by voice interaction device described herein and method.It should also be understood that Voice interaction device described herein and method are also applied to the following electronic equipment with voice interactive function developed In.

With reference to Fig. 1, voice interactive method 100 may begin at step s110 and s112, wherein in step s110, executes voice The electronic equipment of interaction can receive the first phonetic entry from human user, then in step s112, execution interactive voice Electronic equipment can receive the first facial expression image input being associated from human user with this first phonetic entry.For example, Electronic equipment can catch, using mike or microphone array, the voice that human user sends, and simultaneously utilize photographic head Catch the facial expression image of human user.In most of the cases, human user, when carrying out man-machine interaction, is normally at electronics and sets Standby dead ahead, therefore electronic equipment will be captured human face expression acquiescence positioned at dead ahead as carrying out voice friendship The expression of mutual user.In further embodiments, electronic equipment can carry out the people of interactive voice with detect and track Class user.For example, electronic equipment can carry out interactive voice using microphone array by sound localization technology for detection The orientation of human user, then rotating camera to be to be directed at this orientation, thus obtaining the facial expression image of this human user.Sound source is fixed Position for known to the skilled artisan, its ultimate principle is not detailed herein.Using sound localization come detect and track The technical scheme of user be also described in the applicant Chinese invention patent application 201610341566.x and 201610596000.1 in, the disclosure of which is incorporated herein by reference.

It is understood that the video that the audio signal that captures of mike or microphone array and photographic head capture Or picture signal can be pretreated and carry timestamp.So, electronic equipment can be based on the time by phonetic entry (sound Frequency signal) and facial expression image input (video or picture signal) be associated.For example, when electronic equipment has detected phonetic entry When, the facial expression image input same or about with its time can be extracted.

Next, in step s114, speech recognition can be carried out to determine it to the first received phonetic entry First is semantic.Here, the first semanteme can be the literal semanteme of the first phonetic entry, i.e. textual representation, and it has been able to utilize Existing various speech recognition technology and identified with very high accuracy rate.For example, say when human user and " order a tomorrow Remove the air ticket in Shanghai " when, " ordering an air ticket going to Shanghai tomorrow " this illustration and text juxtaposed setting can be identified by speech recognition technology originally, Semantic as first.

Additionally, in step s116, image recognition can be carried out to determine to the first received facial expression image input First expression of human user.For example, it is possible to identify the first of user express one's feelings for happy, anxious, hesitate etc., or user First expression can be neutral expression, that is, poker-faced.

Here it should be understood that in step s114 and s116, the present invention can use any existing speech recognition skill Art and image recognition technology.For example, spendable speech recognition technology may include method based on channel model and phonic knowledge, Method for mode matching and Artificial Neural Network etc., wherein method for mode matching have been obtained for more and more deep grinding Study carefully, it includes such as dynamic time warping (dtw), hidden Markov method (hmm), vector quantization method (vq) etc..Artificial neuron Network method is research popular in recent years, and it typically can be used together in conjunction with existing method for mode matching.Spendable Image recognition technology can be the technology being exclusively used in expression recognition, generally can be divided into following three types: Global estimation drawn game Portion's method of identification；Deformation extraction method and extraction method；Geometrical measurers and total content control method.Identified with conventional entirety and local As a example method, Global estimation may include PCA (the prineipal component of such as feature based face Analysis), Independent component analysis (indendent component analysis), fisher linear discriminant analysis (fisher ' s linear discriminants), Local Features Analysis method (loeal feature analysis), fisher The dynamic method (fisher actions) of birth, HMM method (hmm) and clustering methodology；Local estimation may include face Motion encoded analytic process (facial actions coding system), facial movement parametric method, local principal component analysis method (local pca), gabor wavelet method and neural network etc..It will also be appreciated that being not limited to example given here, this Bright can also be using the speech recognition technology of other and following exploitation and face recognition technology.

Next in step s118, can be semantic semantic with described first with the first expression determination based on first being identified The first associated confidence level.In the present invention, confidence level can be defined as indicating whether described first semanteme is that the mankind use The amount of the true intention at family, for example, it can be a numerical range, and value is bigger, then show more to determine that this first semanteme is The true intention of user, value is lower, then show that this first semanteme more uncertain is the true intention that user is intended by, for example User is also less satisfied to the meaning of this phonetic representation, irresolute in other words.

The target of conventional speech recognition is only that accurately, makes every effort to identify the language literary composition that human user is said exactly Word, therefore this speech recognition process are " mechanical ", so lead to interactive process be also machinery, be totally different from people with Exchange between people.The meaning on spoken and written languages surface when carrying out communication between people, can not only be identified, also can watch the mood and guess the thoughts, Judge mood or the attitude of other side by observing the expression of other side, thus judging whether the language of other side is its true intention table Show.The general principles of the present invention are that in man-machine interaction, judge that speech recognition is tied by the expression identifying human user Whether fruit is the true intention of human user, thus realizing the interactive process more like interpersonal exchange.

Specifically, in step s118, can be the first semantic distribution one acquiescence confidence level first.For example, confidence level Span can be 1 to 10, wherein 10 high one end of expression confidence levels, 1 represents the low one end of confidence level, gives tacit consent to confidence level The middle part of this span, such as 4-6 can be arranged on.In one example, acquiescence confidence level could be arranged to such as 5.

It is then possible to express one's feelings to adjust distributed acquiescence confidence level according to first being identified.Expression can substantially be divided For three classes, i.e. certainty expression, negativity expression and neutral expression.Certainty shows that user's word confidence level is high, is Its true intention represents.For example, when user shows happiness or happiness, pleasantly surprised expression it is believed that confidence level is high.Additionally, When user shows anxiety, serious expression it is also possible to think that the confidence level that it is spoken is high.Therefore, when the first table being identified When feelings are expressed one's feelings for these, acquiescence confidence level can be increased.On the other hand, when user shows indignation, detests, disdains or scorn, fears Fear, grieved, hesitate, surprised, suspect etc. during negativity expression it is believed that its confidence level of speaking is low, therefore reduce and distributed Acquiescence confidence level.For example, when user says " ordering an air ticket going to Shanghai tomorrow " with glad or serious expression, Yong Huke This intention can very be determined, the true intention therefore " ordering an air ticket going to Shanghai tomorrow " and be exactly user represents；And work as user When " ordering an air ticket going to Shanghai tomorrow " is said with hesitation, sad or dejected, angry expression, then it is likely to user also not Determine whether to the Shanghai that go by aeroplane tomorrow, user's tomorrow is gone by aeroplane this schedule of Shanghai is dissatisfied in other words, Therefore " ordering an air ticket going to Shanghai tomorrow " may not be that the true intention that user wants represents, so institute now should be reduced The acquiescence confidence value of distribution.And work as user be neutral expression, for example no special expression when, the acquiescence that can remain distributed is put Certainty value is constant.

It should be understood that the principle of the present invention is not limited to the specific example of these expressions given here, but can also use More expressions are it might even be possible to using different expression classification rules, a particular emotion will be categorized as certainty expression, negative Property expression or neutral expression.

In certain embodiments, each certainty and negativity expression can be divided into different degree or rank again.For example, For happiness or happiness, smile and can represent the happiness of lower degree, grin and laugh at the happiness that can represent moderate, open one's mouth big Laugh at the happiness representing higher degree.Degree according to every kind of expression or rank, the adjustment to acquiescence confidence value can also be different. For example, confidence value can be raised 1 by the certainty expression of lower degree, and moderate certainty expression can be by confidence level Value rise 2, confidence value can be raised 3 by the certainty expression of higher degree.Certainly it will be appreciated that neutral expression can not Divide degree or rank.

In certain embodiments, it is also based on the context of interactive voice to adjust distributed acquiescence confidence level.Example As when the weather that interactive voice content before shows Shanghai tomorrow is heavy rain, then the voice of user " is ordered a tomorrow to go The confidence level of the air ticket in sea " is just relatively low；On the other hand, if the schedule of interactive voice before or user shows that user is bright When there is meeting schedule in Shanghai, then the confidence level of the voice " ordering an air ticket going to Shanghai tomorrow " of user is just relatively low for it.Therefore, Distributed acquiescence confidence value can based on context be adjusted, thus realizing more intelligent confidence value determination process.

After determining the first confidence level of the first semanteme in step s118, in step s120, can be according to first Semanteme to generate the first response message with the first related confidence level.In conventional interactive process, it is all according to first Semanteme is generating response message；And in the present invention, due to be realised that first the first semantic related confidence level, that is, be realised that the One semanteme is the degree that user's true intention represents, therefore can generate different response messages based on this information.At some In embodiment, when determined by the first confidence level high when, such as more than a predetermined threshold, then be based on first standard generate response Information, such as regular speech interaction, generates the information being directly associated with the first semanteme.It is appreciated that it is " directly related Connection " means that this information is based on the information that the first semantic user determining may directly want.For example, if the first language Justice for " ordering an air ticket going to Shanghai tomorrow " and the first confidence level more than predetermined threshold, then ticket letter more than electronic equipment inquiry Breath, without remaining ticket, then generates " not having remove the air ticket in Shanghai remaining tomorrow " such response message；If also had Remaining ticket, then generate " a airline, b airline ... ticket of also having a surplus, airline please be select " such response message.Separately On the one hand, when determined by the first confidence level low when, such as less than predetermined threshold, then show that the first semanteme is most likely not user Declaration of will that truly want or satisfied, now electronic equipment can be based on the second standard next life different from the first standard Become response message, for example, can generate the response message requiring user to confirm the first semanteme.For example, if the first semanteme is for " ordering One tomorrow removes the air ticket in Shanghai " and the first confidence level is less than predetermined threshold, then electronic equipment can generate and " determine that you will order one Remove the air ticket in Shanghai tomorrow " such response message, with the chance thinking over again to user.Additionally, determined by working as When first confidence level is low, electronic equipment can also generate the content for example with the first semantic indirect correlation connection.It is appreciated that " Connect associated " mean that this information may not be the information directly wanted based on the first semantic user determining, but and user Information correlation or related to the first semanteme information directly wanted.For example, when the first semanteme is for " ordering a tomorrow to go The air ticket in Shanghai " and the first confidence level are less than predetermined threshold, then electronic equipment can generate and " determine that you will order a tomorrow and go to Shanghai Air ticket？The weather in Shanghai will have heavy rain tomorrow " or " determine that you will order an air ticket going to Shanghai tomorrow？Tomorrow you There is meeting schedule in Beijing " such response message, so that user considers that correlative factor to determine whether the first semanteme is that it is true It is intended to.

Then, in step s122, by phonetic synthesis (tts) technology, the generated first response message can be synthesized For voice, play to human user will pass through speaker and/or display, thus completing the interactive voice mistake of one bout Journey.Equally, the present invention can be repeated no more here using the speech synthesis technique of any existing or following exploitation.

In certain embodiments, according to the first corresponding tone of expression, the first response message can be synthesized language Sound.For example, when first as user is expressed one's feelings for glad or happy, excited expression, step s122 can also be using glad language Gas is synthesizing voice；When user is sad, dejected, frightened expression, step s122 can be synthesized using the tone of comfort Voice；When user be indignation, the angry, expression detesting, disdain when, step s122 then can be synthesized using the tone in a timid manner Voice.So, the voice response playing to user can be easier to be easily accepted by a user, and contributes to improving the mood of user, Improve the interactive experience of user.Certainly, the corresponding relation between the tone of synthesis voice and expression is not limited to given here showing Example, but differently can be defined according to application scenarios.

In the conventional phonetic synthesis with emotion, generally require the semanteme to text and be analyzed, determined by machine and close Become the emotion needed for voice or the tone.And in the present invention, can directly using the first expression being identified, and adopt corresponding The tone or emotion, to synthesize voice, are analyzed to determine the process of the tone such that it is able to omit to text, simpler in program, And the synthesized voice tone can more accurately meet the current mood of user or emotion, so that interactive process is richer There is genuine human interest, and avoid frosty mechanical sense.

Describe some exemplary embodiments of the present invention above with reference to Fig. 1, it is applied to many general speech exchanges Scene.But, interpersonal speech exchange is complicated, is likely encountered various special circumstances.To retouch below in conjunction with the accompanying drawings State man machine language's exchange method that some can tackle similar special screne.

Fig. 2 illustrates according to another exemplary embodiment of the present invention based on first semantic and first expression determination the first confidence The flow chart of the process 200 of degree.In step s118 above in relation to Fig. 1 description, by being distributed based on the first expression adjustment Acquiescence confidence level determining the first confidence level.Specifically, when the first expression is certainty expression, increase acquiescence confidence Degree；When the first expression is negativity expression, reduce acquiescence confidence level；When the first expression is neutral expression, then maintain acquiescence Confidence level is constant.However, it is contemplated that the complexity of speech exchange, this adjustment mode there may be defect in some cases. For example, when human user states some sad things with very sad expression, or stated with very terrified expression During very terrified thing, typically can determine that the confidence level of its language is higher, and should not reduce its confidence level.Therefore, in Fig. 2 institute In the embodiment shown, whether retrieve in step s210 first in the first semanteme containing key word of being in a bad mood.Emotion key word refers to The vocabulary can be associated with specific expression or emotion, such as disaster, accident etc. be associated with sad, frightened etc., travel, purchase Thing etc. is associated with happiness, etc..If not retrieving emotion key word in step s210, execute in step s212 The previously described step to adjust distributed acquiescence confidence level based on the first expression.If retrieving feelings in step s210 Thread key word, then judge in step s214 whether retrieved emotion key word is mated with the first expression.In some enforcements In example, step s210 may retrieve multiple emotion key words, then can be by each key word and the in step s214 One expression compares, as long as there being an emotion key word to match with the first expression, then judged result is coupling；Only institute is in love When thread key word and the first expression all mismatch, then judged result is to mismatch.

If the judged result in step s214 is to mismatch, step s216 can execute previously described being based on First step expressed one's feelings to adjust distributed acquiescence confidence level；If the judged result in step s214 is coupling, this shows The expression of human user and its voice content match, then it is considered that the confidence level of the first semanteme is very high, then in step Can directly increase distributed acquiescence confidence level in s218, and can using the confidence level after increasing as with the first semantic phase First confidence level output of association, for the operation as described by step s120 below.

Described above is and judge, from the content of the first semanteme, the situation that its whether with the first expression matches.In other feelings It is also contemplated that the type of the first semanteme is carrying out interactive voice in condition.Fig. 3 shows base according to another embodiment of the present invention Flow chart in the first semantic process 300 determining the first confidence level with the first expression.As shown in figure 3, in step s310, first First may determine that the semantic type of the first semanteme.For from linguisticss, semantic type is generally divided into three kinds, statement, query and Require, i.e. assertive sentence, interrogative sentence and imperative sentence, different semantic types corresponds generally to different confidence levels.For example, work as user When saying interrogative sentence, typically show that it wonders certain answer, therefore confidence level is typically higher；And work as user and saying When assertive sentence and imperative sentence, typically it is difficult to judge confidence level based on semantic type.

Therefore, if judging that in step s310 the semantic type of the first semanteme is query, can be in step s312 Directly increase distributed acquiescence confidence level, and can be using the confidence level after increasing as first being associated with the first semanteme Confidence level exports, for the operation as described by step s120 below.On the other hand, if judging in step s310 The semantic type of one semanteme is statement or requires, or perhaps other semantic types in addition to query, then can be in step The foregoing step to adjust distributed acquiescence confidence level based on the first expression is executed in s314.

Fig. 4 shows a case that to consider emotion key word described above and two kinds of factors 400 of semantic type.With reference to Fig. 4, The semantic type of the first semanteme is may determine that first in step s410.If the semantic type of the first semanteme is query, Increase distributed acquiescence confidence level in step s412, and the confidence level after increasing can be associated as with the first semanteme First confidence level output, for the operation as described by step s120 below.If the semantic type of the first semanteme is Statement or requirement, or perhaps other semantic types in addition to query, then can be carried out to step s414.

In step s414, can continue whether to judge in the first semanteme containing key word of being in a bad mood.If in the first semanteme Without key word of being in a bad mood, then execute the previously described step to adjust acquiescence confidence level based on the first expression in step s416 Suddenly；If containing, in the first semanteme, key word of being in a bad mood, continue to judge described emotion key word and the first table in step s418 Whether feelings mate.If it does, then directly increasing distributed acquiescence confidence level in step s420, and after can increasing Confidence level as with first semanteme be associated first confidence level export, for the behaviour as described by step s120 below Make；If it does not match, executing the previously described step that acquiescence confidence level is adjusted based on the first expression in step s422.

Fig. 5 illustrate based on first being identified semantic and determined by the first confidence level generate the another of the first response message The flow chart of one embodiment 500.First, in step s510 it can be determined that whether the first confidence value is more than predetermined threshold.As Front described, predetermined threshold can be a predetermined confidence standard, when the first confidence value is more than predetermined threshold, then It is considered that confidence level is high；If the first confidence level is less than predetermined threshold, it is considered that confidence level is low.

When the first confidence level is more than predetermined threshold, then can generate in step s512 and include with the first semanteme directly First response message of associated content.When the first confidence level is less than predetermined threshold, then can continue in step s514 Confidence value (for convenience, can be described as the second confidence level here) phase by the first confidence level and previous phonetic entry Relatively.Comparison between first confidence level and the second confidence level before can reflect the feelings of human user during interactive voice Thread changes.For example, if the first confidence level is more than the second confidence level, although showing absolute confidence, still relatively low (first puts Reliability is less than threshold value), but relative confidence increases (the first confidence level is more than the second confidence level), so interaction may Developing toward the good aspect.Now, in step s516, the first response that request human user confirms the first semanteme can be generated Information.On the other hand, if determine in step s514 the first confidence level be less than before the second confidence level, show not only exhausted Low to confidence level, and relative confidence is also declining, and interaction may develop to bad direction.Now, in step The first response message generating in s518 not only can include the content asking human user to confirm the first semanteme, but also permissible Including the content with the first semantic indirect correlation connection so that user considers and selects.

Below, it is described with reference to Figure 6 the voice interaction device according to the present invention one exemplary embodiment.As it was previously stated, this The voice interaction device of invention may apply to any type of electronic equipment with human-computer interaction function, including but not limited to Mobile electronic device, such as smart mobile phone, flat board, notebook, robot, personal digital assistant, vehicle electronic device, And non-mobile electronic equipment, such as desktop computer, information service terminal, ticket terminals, intelligent appliance equipment, intelligent customer service Equipment etc..These equipment may be by voice interaction device described herein and method.It should also be understood that being described herein Voice interaction device be also applied in the electronic equipment with voice interactive function of following exploitation.

As shown in fig. 6, voice interaction device 600 may include sound identification module 610, picture recognition module 620, confidence level Module 630, response generation module 640 and voice synthetic module 650.Sound identification module 610 is configurable to identification and is derived from people The first of first phonetic entry of class user is semantic.It is appreciated that sound identification module 610 can utilize any existing, example As commercially available speech recognition engine, or can also be using the speech recognition engine of following exploitation.Picture recognition module 620 Can be configured to the first table from the first facial expression image input being associated with described first phonetic entry of human user for the identification Feelings.It is also understood that picture recognition module 620 can be using any existing, for example commercially available facial expression image identification Engine, or can also be using the facial expression image identification engine of following exploitation.Confidence level module 630 can be based on speech recognition mould First semantic and picture recognition module 620 identification the first expression of block 610 identification is associated with the described first semanteme to determine The first confidence level.For example, confidence level module 630 can be the first semantic distribution one acquiescence confidence level first, is then based on the One expresses one's feelings to adjust distributed acquiescence confidence level, to obtain the first final confidence level.Specifically, when the first expression is willing During qualitative expression, then increase described acquiescence confidence level；When the first expression is negativity expression, then reduce described acquiescence confidence Degree；Other expressions outside the first expression is in addition to certainty expression and negativity expression, such as during neutral expression, then maintain The acquiescence confidence level being distributed is constant.

In certain embodiments, whether confidence level module 630 can also judge the first semanteme containing key word of being in a bad mood, and The emotion being comprised key word is compared with the first expression.If the emotion key word that the first semanteme comprises and the first expression phase Coupling, then the confidence level that explanation user speaks is high, therefore directly increases distributed acquiescence confidence level.If the first semanteme does not wrap Include emotion key word, emotion key word included in other words and the first expression mismatch, then can execute previously described base Express one's feelings to adjust the operation of distributed acquiescence confidence level in first.

In certain embodiments, confidence level module 630 can also judge the semantic type of the first semanteme.If first is semantic Semantic type be query then it is assumed that the confidence level spoken of user is high, therefore directly increase distributed acquiescence confidence value；As Fruit is other semantic types outside query, for example, state or require, then can execute previously described be based on the first expression Lai The operation of the distributed acquiescence confidence level of adjustment.

In certain embodiments, confidence level module 630 is also based on context to adjust distributed acquiescence confidence level. For example, if first is semantic consistent with the context of interactive voice, its confidence level is high, therefore increases distributed acquiescence confidence Degree；On the contrary, if it is inconsistent, reducing distributed acquiescence confidence level.

With continued reference to Fig. 6, the response generation module 640 of voice interaction device 600 can be using from sound identification module Semantic and from confidence level module 630 the first confidence level of the first of 610 generates the first response message.Response generation module 640 First response message can be generated with different standards according to the first confidence level.In certain embodiments, when the first confidence level When more than predetermined threshold, then first standard that is based on generates the first response message, for example, generate and include directly phase semantic with first First response message of the content of association；When the first confidence level is less than predetermined threshold, then second standard that is based on generates the first sound Answer information, for example, generate the first response message that request human user confirms the first semanteme, or for example generate and also include and the First response message of the content of one semantic indirect correlation connection.

The process generating response message can be related to using knowledge base 660.Knowledge base 660 can be local knowledge base, It can be included as a part for speech recognition equipment 600 it is also possible to as shown in Figure 6, is high in the clouds knowledge base 660, speech recognition equipment 600 passes through the network connection of such as wide area network or LAN etc to high in the clouds knowledge base 660.Knowledge base 660 may include various knowledge datas, such as weather data, flight data, hotel's data, movie data, music data, food and drink number According to, stock certificate data, tourism data, map datum, government bodies' data, domain knowledge, historical knowledge, natural science knowledge, society Meeting scientific knowledge etc..Response generation module 640 can obtain directly or indirectly related to the first semanteme from knowledge base 660 Knowledge, for generating the first response message.

In certain embodiments, when the first confidence level is more than predetermined threshold, response generation module 640 generate include with First response message of the content that the first semanteme is directly associated；When the first confidence level is less than predetermined threshold, then response generates First confidence level is also compared by module 640 with the second confidence level, and described second confidence level is just described with human user The confidence level that a phonetic entry before first phonetic entry is associated.If the first confidence level is more than the second confidence level, Then response generation module 640 can generate the first response message that request human user confirms the first semanteme；If the first confidence Degree is less than the second confidence level, then response generation module 640 can generate request human user and confirms that first is semantic and include and the First response message of the content of one semantic indirect correlation connection.

Then, the first response message that response generation module 640 is generated can be synthesized language by voice synthetic module 650 Sound, to play to human user by speaker (not shown), thus complete the interactive voice process of one bout.Real at some Apply in example, voice synthetic module 650 can also express one's feelings to carry out phonetic synthesis using from the first of picture recognition module 620. Specifically, the first response message can be synthesized language according to the first corresponding tone of expression by voice synthetic module 650 Sound.For example, when first as user is expressed one's feelings for glad or happy, excited expression, voice synthetic module 650 can also be using height The emerging tone is synthesizing voice；When user is sad, dejected, frightened expression, voice synthetic module 650 can be using comfort The tone synthesizing voice；When user be indignation, the angry, expression detesting, disdain when, then voice synthetic module 650 can be adopted Synthesize voice with the tone in a timid manner.So, the voice response playing to user can be easier to be easily accepted by a user, and has Help improve the mood of user, improve the interactive experience of user.Certainly, voice synthetic module 650 can also be according to other tables Corresponding relation between feelings and the tone is carrying out phonetic synthesis, and is not limited to example given here.

Fig. 7 illustrates according to the present invention one exemplary embodiment, available voice interaction device described above and method Electronic equipment block diagram.As shown in fig. 7, electronic equipment 700 may include voice receiving unit 710 and image receiving unit 720. Voice receiving unit 710 can be such as mike or microphone array, and it can catch the voice of user.Image receiving unit 720 can be such as monocular cam, binocular camera or more purpose photographic head, and it can catch the image of user, especially It is face image, and therefore image receiving unit 720 can have face recognition function, with the clear table catching user exactly Feelings image.

As shown in fig. 7, electronic equipment 700 can also include one or more processors 730 and memorizer 740, they lead to Cross bus system 750 to be connected to each other with voice receiving unit 710 and image receiving unit 720.Processor 730 can be centre Reason unit (cpu) or have the processing unit of other forms of data-handling capacity and/or instruction execution capability, process cores, Or controller, and it can be with the miscellaneous part in control electronics 700 to execute desired function.Memorizer 740 is permissible Including one or more computer programs, described computer program can include various forms of computer-readables and deposit Storage media, such as volatile memory and/or nonvolatile memory.Described volatile memory for example can include depositing at random Access to memory (ram) and/or cache memory (cache) etc..Described nonvolatile memory for example can include read-only Memorizer (rom), hard disk, flash memory etc..One or more computer journeys can be stored on described computer-readable recording medium Sequence instructs, and processor 740 can run described program instruction, to realize the interactive voice of embodiments herein mentioned above Method and/or other desired functions.Various application programs can also be stored in described computer-readable recording medium With various data, such as user data, knowledge data base etc..

Additionally, electronic equipment 700 can also include output unit 760.Output unit 760 can be such as speaker, with Carry out interactive voice with user.In further embodiments, output unit 760 can also be the output dress such as display, printer Put.

In addition to said method, device and equipment, present embodiments can also be computer program, its Including computer program instructions, described computer program instructions make this explanation of described computing device when being run by processor The step in the voice interactive method according to each embodiment of the application described in book.

Described computer program can be write for holding with the combination in any of one or more programming language The program code of row the embodiment of the present application operation, described program design language includes object oriented program language, such as Java, c++ etc., also include the procedural programming language of routine, such as " c " language or similar programming language.Journey Sequence code fully can execute on an electronic device, partly execute on an electronic device, as an independent software kit Execute, partly on consumer electronic devices, part executes or on a remote computing completely in remote computing device or clothes Execute on business device.

Additionally, embodiments herein can also be computer-readable recording medium, it is stored thereon with computer program and refers to Order, described computer program instructions make when being run by processor the description of described computing device this specification according to this Shen Please step in the voice interactive method of various embodiments.

Described computer-readable recording medium can adopt the combination in any of one or more computer-readable recording mediums.Computer-readable recording medium can To be readable signal medium or readable storage medium storing program for executing.Readable storage medium storing program for executing for example can include but is not limited to electricity, magnetic, light, electricity The system of magnetic, infrared ray or quasiconductor, device or device, or arbitrarily above combination.Readable storage medium storing program for executing is more specifically Example (non exhaustive list) includes: has the electrical connection of one or more wires, portable disc, hard disk, random access memory Device (ram), read only memory (rom), erasable programmable read only memory (eprom or flash memory), optical fiber, portable compact disc Read only memory (cd-rom), light storage device, magnetic memory device or above-mentioned any appropriate combination.

Describe the ultimate principle of the application above in association with specific embodiment, however, it is desirable to it is noted that in this application The advantage that refers to, advantage, effect etc. are only exemplary rather than limiting it is impossible to think that these advantages, advantage, effect etc. are the application Each embodiment is prerequisite.In addition, detail disclosed above is merely to the effect of example and the work readily appreciating With, and unrestricted, it is must to be realized using above-mentioned concrete details that above-mentioned details is not intended to limit the application.

The device that is related in the application, device, equipment, the block diagram of system are only used as exemplary example and are not intended to Require or hint must be attached, arrange, configure according to the mode that square frame illustrates.As it would be recognized by those skilled in the art that , can be connected, be arranged by any-mode, configure these devices, device, equipment, system.Such as " include ", "comprising", " tool Have " etc. word be open vocabulary, refer to " including but not limited to ", and can be with its used interchangeably.Vocabulary used herein above "or" and " and " refer to vocabulary "and/or", and can be with its used interchangeably, unless it is not such that context is explicitly indicated.Here made Vocabulary " such as " refers to phrase " such as, but not limited to ", and can be with its used interchangeably.

It may also be noted that in the equipment and method of the application, each part or each step are can to decompose and/or weight Combination nova.These decompose and/or reconfigure the equivalents that should be regarded as the application.

There is provided the above description of disclosed aspect so that any person skilled in the art can make or using this Application.Various modifications to these aspects are readily apparent to those skilled in the art, and here definition General Principle can apply to other aspects without deviating from scope of the present application.Therefore, the application is not intended to be limited to Aspect shown in this, but according to the widest range consistent with principle disclosed herein and novel feature.

In order to purpose of illustration and description has been presented for above description.Additionally, this description is not intended to the reality of the application Apply example and be restricted to form disclosed herein.Although already discussed above multiple exemplary aspect and embodiment, this area skill Art personnel will be recognized that its some modification, modification, change, interpolation and sub-portfolio.

Claims

1. a kind of voice interactive method, comprising:

Receive the first facial expression image be associated from the first phonetic entry of human user and with described first phonetic entry defeated Enter；

Identify the first semanteme of described first phonetic entry；

Identify the first expression of described first facial expression image input；

The first confidence level being associated with the described first semanteme is determined based on the described first semantic and described first expression；And

First response message is generated based on the described first semantic and described first confidence level.

2. the method for claim 1, wherein determine that the first confidence level being associated with the described first semanteme is included:

For the described first semantic distribution one acquiescence confidence level；And

Express one's feelings to adjust described acquiescence confidence level based on described first, comprising:

When the described first expression is certainty expression, increase described acquiescence confidence level；

When the described first expression is negativity expression, reduce described acquiescence confidence level；And

When the described first expression is in addition to the neutral expression outside described certainty expression is expressed one's feelings with described negativity, maintain institute State acquiescence confidence level constant.

3. the method for claim 1, wherein determine that the first confidence level being associated with the described first semanteme is also included:

Whether judge in described first semanteme containing key word of being in a bad mood；

If not containing key word of being in a bad mood in described first semanteme, express one's feelings based on described first described in execution described silent to adjust The step recognizing confidence level；

If contain in described first semanteme be in a bad mood key word, judge described emotion key word and described first expression whether Join；

If described emotion key word is matched with the described first expression, increase described acquiescence confidence level；And

If described emotion key word is mismatched with described first expression, based on described first expression to adjust described in execution The step stating acquiescence confidence level.

4. the method for claim 1, determines that the first confidence level being associated with the described first semanteme is also included:

Judge the semantic type of described first semanteme；

If the semantic type of described first semanteme is query, increase described acquiescence confidence level；And

If the semantic type of described first semanteme is statement or requires, execution is described to be based on described first expression to adjust The step stating acquiescence confidence level.

5. the method for claim 1, wherein it is based on the described first semantic and described first confidence level and generate the first sound Information is answered to include:

When described first confidence level is more than predetermined threshold, then generate the content including being directly associated with described first semanteme The first response message；

When described first confidence level is less than described predetermined threshold, then generates the described human user of request and confirm that described first is semantic The first response message.

6. method as claimed in claim 5, wherein, when described first confidence level is less than the being generated during described predetermined threshold One response message also includes the content with the described first semantic indirect correlation connection.

7. the method for claim 1, wherein it is based on the described first semantic and described first confidence level and generate the first sound Information is answered to include:

When described first confidence level is less than described predetermined threshold, then described first confidence level is compared with the second confidence level, Described second confidence level is to be associated with a just phonetic entry before described first phonetic entry of described human user Confidence level；

If described first confidence level is more than described second confidence level, generates the described human user of request and confirm described first The first semantic response message；And

If described first confidence level is less than described second confidence level, generates the described human user of request and confirm described first language Justice and include the first response message with the content of the described first semantic indirect correlation connection.

8. the method for claim 1, also includes responding described first according to the described first corresponding tone of expression Information synthesizes voice to play to described human user.

9. a kind of voice interaction device, comprising:

Sound identification module, is configured to identify that first of the first phonetic entry from human user is semantic；

Picture recognition module, is configured to identify the first table being associated with described first phonetic entry from described human user First expression of feelings image input；

Confidence level unit, is configured to the described first semantic and described first expression and is associated with the described first semanteme to determine The first confidence level；And

Response signal generating unit, is configured to the described first semantic and described first confidence level and generates the first response message.

10. device as claimed in claim 9, wherein, described confidence level unit is configured to pass execution following steps to determine The first confidence level being associated with the described first semanteme:

11. devices as claimed in claim 10, wherein, described confidence level unit is additionally configured to by executing following steps Lai really Fixed the first confidence level being associated with the described first semanteme:

Judge the semantic type of described first semanteme；

If the semantic type of described first semanteme is query, increase described acquiescence confidence level；

If the semantic type of described first semanteme is statement or requires, whether judge in described first semanteme containing pass of being in a bad mood Keyword；

12. devices as claimed in claim 9, wherein, described response generation module is configured to pass execution following steps next life Become described first response message:

13. devices as claimed in claim 12, wherein, when described first confidence level is less than described predetermined threshold, described sound The first response message that generation module generates is answered also to include the content with the described first semantic indirect correlation connection.

14. a kind of electronic equipments, comprising:

Voice receiving unit；

Image receiving unit；

Memorizer；And

Processor, is connected with described voice receiving unit, described image receiving unit and described memorizer each other by bus system Connect, described processor is configured to run the instruction being stored on described memorizer with any one of perform claim requirement 1-8 institute The method stated.

A kind of 15. computer programs, including computer program instructions, described computer program instructions are being run by processor When make method according to any one of claim 1-8 for the described computing device.