CN106373569A - Voice interaction apparatus and method - Google Patents
Voice interaction apparatus and method Download PDFInfo
- Publication number
- CN106373569A CN106373569A CN201610806384.5A CN201610806384A CN106373569A CN 106373569 A CN106373569 A CN 106373569A CN 201610806384 A CN201610806384 A CN 201610806384A CN 106373569 A CN106373569 A CN 106373569A
- Authority
- CN
- China
- Prior art keywords
- confidence level
- expression
- semanteme
- semantic
- acquiescence
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 71
- 230000003993 interaction Effects 0.000 title claims abstract description 30
- 230000014509 gene expression Effects 0.000 claims abstract description 151
- 230000004044 response Effects 0.000 claims abstract description 79
- 230000008451 emotion Effects 0.000 claims description 33
- 230000002452 interceptive effect Effects 0.000 claims description 32
- 206010027940 Mood altered Diseases 0.000 claims description 20
- 238000004590 computer program Methods 0.000 claims description 15
- 230000008921 facial expression Effects 0.000 claims description 15
- 230000007935 neutral effect Effects 0.000 claims description 11
- 230000008569 process Effects 0.000 abstract description 20
- 238000005516 engineering process Methods 0.000 description 16
- 230000006870 function Effects 0.000 description 7
- 230000015572 biosynthetic process Effects 0.000 description 6
- 238000003786 synthesis reaction Methods 0.000 description 6
- 238000004458 analytical method Methods 0.000 description 5
- 230000008901 benefit Effects 0.000 description 5
- 230000006854 communication Effects 0.000 description 5
- 238000010586 diagram Methods 0.000 description 5
- 230000036651 mood Effects 0.000 description 5
- 238000004891 communication Methods 0.000 description 4
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000002194 synthesizing effect Effects 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 2
- 235000013399 edible fruits Nutrition 0.000 description 2
- 230000005611 electricity Effects 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 230000001815 facial effect Effects 0.000 description 2
- 230000004807 localization Effects 0.000 description 2
- 230000033001 locomotion Effects 0.000 description 2
- 208000019901 Anxiety disease Diseases 0.000 description 1
- 241001269238 Data Species 0.000 description 1
- 206010016275 Fear Diseases 0.000 description 1
- 230000009471 action Effects 0.000 description 1
- 230000036506 anxiety Effects 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000012790 confirmation Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000000151 deposition Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000004069 differentiation Effects 0.000 description 1
- 235000013305 food Nutrition 0.000 description 1
- 238000012880 independent component analysis Methods 0.000 description 1
- 210000002569 neuron Anatomy 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 238000012847 principal component analysis method Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000013139 quantization Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000005236 sound signal Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/168—Feature extraction; Face representation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/174—Facial expression recognition
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
- G10L2015/225—Feedback of the input speech
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- Multimedia (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- Oral & Maxillofacial Surgery (AREA)
- General Health & Medical Sciences (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- User Interface Of Digital Computer (AREA)
- Machine Translation (AREA)
Abstract
The invention relates to a voice interaction apparatus and method. In one embodiment of the invention, the voice interaction method can comprise the following steps: receiving first voice input from a human user and first expression image input associated with the first voice input; identifying a first meaning of the first voice input; identifying a first expression of the first expression image input; based on the first meaning and the first expression, determining first confidence associated with the first meaning; and based on the first meaning and the first confidence, generating first response information. According to the invention, the response information is generated by use of the expression and the meaning, and the method can improve the experience of the human user in a man-machine voice interaction process.
Description
Technical field
Present invention relates generally to field of human-computer interaction, more specifically it relates to a kind of voice interaction device and method, its energy
Enough improve the accuracy of speech recognition, generate more appropriate voice answer-back, thus realizing man-machine friendship that is more intelligent and more personalizing
Mutually process.
Background technology
Language is interpersonal most convenient, maximally effective communication mode, therefore it is readily conceivable that by voice communications applications
To field of human-computer interaction, replace the man-machine interaction mode of traditional such as keyboard, mouse etc.Man-machine natural language dialogue meaning
The spoken language that machine " can understand " mankind, here it is speech recognition technology.
Language is an art that experienced thousands of years differentiation, and it comprises the abundant information in remote superwood face, and uses language
The mankind be the intelligent beings with multiple emotions again, therefore, the communication of interpersonal simple and fast is for machine
Speech is probably high complexity.Although having been proposed for many technology at present to improve the accuracy of speech recognition, however, these
Prior art is substantially a kind of pattern matching process, i.e. the pattern of identification received speech, by its reference with known voice
Pattern is compared one by one, to determine recognition result.In these prior arts, the information that voice is comprised and and voice
The utilization of related information is still less, leads to speech recognition technology sometimes can not efficiently identify the true meaning of human user
Think to represent.For example, it is possible that irony, the words said in a fit of rage, not knowing where the tone, these are all for interpersonal speech exchange
Through the identification ability beyond existing voice technology of identification.Existing voice technology of identification can only be carried out in the form of a kind of " mechanical "
Interactive voice process, hinders machinery equipment to more intelligent and direction of more personalizing development.
Accordingly, it would be desirable to a kind of improved man-machine language interactive device and method, it enables machinery equipment more accurately
Understand the true intention of human user such that it is able to improve the wisdom degree of machinery equipment and the level that personalizes, more efficient topotype
Intend interpersonal communication process, and improve the interactive experience of human user.
Content of the invention
One aspect of the present invention is in man machine language's interaction, by using more information, to make machine
Equipment can more accurately understand the true intention of human user.
The present invention one exemplary embodiment provides a kind of voice interactive method, and it mays include: and receives from human user
First phonetic entry and the first facial expression image input being associated with described first phonetic entry;Identify described first phonetic entry
First semantic;Identify the first expression of described first facial expression image input;Based on the described first semantic and described first expression
Determine the first confidence level being associated with the described first semanteme;And it was based on for the described first semantic and described first confidence level next life
Become the first response message.
In one example, determine that the first confidence level being associated with the described first semanteme is mayd include: semantic for described first
Distribution one acquiescence confidence level;And express one's feelings to adjust described acquiescence confidence level based on described first.
In one example, determine that the first confidence level being associated with the described first semanteme be may also include that based on interactive voice
Context adjusting described acquiescence confidence level.
In one example, express one's feelings based on described first and express one's feelings when described first to adjust described acquiescence confidence level and to may include:
When being certainty expression, increase described acquiescence confidence level;When the described first expression is negativity expression, reduces described acquiescence and put
Reliability;And when the described first expression is in addition to the neutral expression outside described certainty expression is expressed one's feelings with described negativity,
Maintain described acquiescence confidence level constant.
In one example, described certainty expression may include happy, pleasantly surprised, anxious, serious, and described negativity expression can be wrapped
Include indignation, detest, disdain, frightened, grieved, hesitate, surprised, suspect.
In one example, determine that the first confidence level being associated with the described first semanteme be may also include that and judge described first
Whether containing key word of being in a bad mood in semanteme;If not containing key word of being in a bad mood in described first semanteme, execution is described to be based on institute
State the step to adjust described acquiescence confidence level for first expression;If containing, in described first semanteme, key word of being in a bad mood, judge
Whether described emotion key word is mated with the described first expression;If described emotion key word is matched with the described first expression,
Then increase described acquiescence confidence level;And if described emotion key word is mismatched with described first expression, then execute described base
In the described first step expressed one's feelings to adjust described acquiescence confidence level.
In one example, determine that the first confidence level being associated with the described first semanteme be may also include that and judge described first
Semantic semantic type;If the semantic type of described first semanteme is query, increase described acquiescence confidence level;And if
The semantic type of described first semanteme is to state or require, then adjust described acquiescence based on described first expression described in execution and put
The step of reliability.
In one example, determine that the first confidence level being associated with the described first semanteme be may also include that and judge described first
Semantic semantic type;If the semantic type of described first semanteme is query, increase described acquiescence confidence level;And if
Whether the semantic type of described first semanteme is statement or requires, then judge in described first semanteme containing key word of being in a bad mood;As
Key word of being in a bad mood is not contained, then execution is described expresses one's feelings to adjust described acquiescence confidence based on described first in really described first semanteme
The step of degree;If containing, in described first semanteme, key word of being in a bad mood, judge described emotion key word and described first expression
Whether mate;If described emotion key word is matched with the described first expression, increase described acquiescence confidence level;And if
Described emotion key word is expressed one's feelings with described first and is mismatched, then adjust described acquiescence based on described first expression described in execution and put
The step of reliability.
In one example, generate the first response message based on the described first semantic and described first confidence level to may include:
When described first confidence level is more than predetermined threshold, then generate the of the content including being directly associated with described first semanteme
One response message;When described first confidence level is less than described predetermined threshold, then generate request described human user confirmation described
First response message of the first semanteme.
In one example, also may be used when described first confidence level is less than the first response message being generated during described predetermined threshold
Including the content with the described first semantic indirect correlation connection.
In one example, generate the first response message based on the described first semantic and described first confidence level to may include:
When described first confidence level is more than predetermined threshold, then generate the of the content including being directly associated with described first semanteme
One response message;When described first confidence level is less than described predetermined threshold, then by described first confidence level and the second confidence level
Compare, described second confidence level is the just phonetic entry before described first phonetic entry with described human user
Associated confidence level;If described first confidence level is more than described second confidence level, generates and ask described human user
Confirm the first response message of described first semanteme;And if described first confidence level is less than described second confidence level, then give birth to
The described human user of request is become to confirm that described first is semantic and include with the content of the described first semantic indirect correlation connection the
One response message.
In one example, methods described may also include and responds described first according to the described first corresponding tone of expression
Information synthesizes voice to play to described human user.
Another exemplary embodiment of the present invention provides a kind of voice interaction device, and it mays include: sound identification module, configuration
It is derived from the first semanteme of the first phonetic entry of human user for identification;Picture recognition module, is configured to identification and is derived from described people
First expression of the first facial expression image input being associated with described first phonetic entry of class user;Confidence level unit, configuration
It is based on described first the first confidence level that semantic and described first expression to determine and the described first semanteme is associated;And ring
Answer signal generating unit, be configured to the described first semantic and described first confidence level and generate the first response message.
In one example, described confidence level unit may be configured to execute following steps to determine and the described first semanteme
The first associated confidence level: for the described first semantic distribution one acquiescence confidence level;And express one's feelings to adjust based on described first
Described acquiescence confidence level.
In one example, described confidence level unit may be additionally configured to determine and described first language by executing following steps
The first associated confidence level of justice: the context based on interactive voice is adjusting described acquiescence confidence level.
In one example, described confidence level unit may be configured to execute following steps to adjust based on the described first expression
Whole described acquiescence confidence level: when the described first expression is certainty expression, increase described acquiescence confidence level;When described first table
When feelings are negativity expressions, reduce described acquiescence confidence level;And when described first expression be in addition to described certainty expression and
During neutral expression outside described negativity expression, maintain described acquiescence confidence level constant.
In one example, described certainty expression may include happy, pleasantly surprised, anxious, serious, and described negativity expression can be wrapped
Include indignation, detest, disdain, frightened, grieved, hesitate, surprised, suspect.
In one example, described confidence level unit may be additionally configured to determine and described first language by executing following steps
The first associated confidence level of justice: whether judge in described first semanteme containing key word of being in a bad mood;If in described first semanteme
Without key word of being in a bad mood, then execute the described step that described acquiescence confidence level is adjusted based on the described first expression;If institute
State in the first semanteme and contain key word of being in a bad mood, then judge whether described emotion key word is mated with the described first expression;If institute
State emotion key word to match with the described first expression, then increase described acquiescence confidence level;And if described emotion key word
Mismatch with the described first expression, then execute the described step that described acquiescence confidence level is adjusted based on the described first expression.
In one example, described confidence level unit may be additionally configured to determine and described first language by executing following steps
The first associated confidence level of justice: judge the semantic type of described first semanteme;If the semantic type of described first semanteme is
Query, then increase described acquiescence confidence level;And if the semantic type of described first semanteme is statement or requires, then execute institute
State based on the described first step expressed one's feelings to adjust described acquiescence confidence level.
In one example, described response generation module may be configured to execute following steps to generate described first response
Information: when described first confidence level is more than predetermined threshold, then generate in including being directly associated with described first semanteme
The first response message held;When described first confidence level is less than described predetermined threshold, then generates and ask described human user true
Recognize the first response message of described first semanteme.
In one example, when described first confidence level is less than described predetermined threshold, described response generation module generates
First response message may also include the content with the described first semantic indirect correlation connection.
In one example, described response generation module may be configured to execute following steps to generate described first response
Information: when described first confidence level is more than predetermined threshold, then generate in including being directly associated with described first semanteme
The first response message held;When described first confidence level is less than described predetermined threshold, then by described first confidence level and second
Confidence level compares, and described second confidence level is the just language before described first phonetic entry with described human user
The associated confidence level of sound input;If described first confidence level is more than described second confidence level, generates and ask described people
Class user confirms the first response message of described first semanteme;And if described first confidence level is less than described second confidence
Degree, then generate the described human user of request and confirm that described first is semantic and include and the described first semantic indirect correlation connection
First response message of content.
In one example, described device may also include that voice synthetic module, is configured to according to corresponding with the described first expression
The tone described first response message is synthesized voice to play to described human user.
Another exemplary embodiment of the present invention provides a kind of electronic equipment, and it mays include: voice receiving unit;Image-receptive
Unit;Memorizer;And processor, by bus system and described voice receiving unit, described image receiving unit and described
Memorizer is connected to each other, and described processor is configured to run the instruction being stored on described memorizer to execute previously described side
Any one in method.
Another exemplary embodiment of the present invention provides a kind of computer program, and it may include computer program instructions,
Described computer program instructions can make any in the previously described method of described computing device when being run by processor
A kind of.
Another exemplary embodiment of the present invention provides a kind of computer-readable recording medium, and can be stored with computer journey thereon
Sequence instructs, and described computer program instructions can make when being run by processor in the previously described method of described computing device
Any one.
Brief description
By combining accompanying drawing, the embodiment of the present application is described in more detail, the above-mentioned and other purposes of the application,
Feature and advantage will be apparent from.Accompanying drawing is used for providing the embodiment of the present application is further understood, and constitutes explanation
A part for book, is used for explaining the application together with the embodiment of the present application, does not constitute the restriction to the application.In the accompanying drawings,
Identical reference number typically represents same parts or step.
Fig. 1 is the flow chart illustrating the voice interactive method according to the present invention one exemplary embodiment.
Fig. 2 is to illustrate to determine the process of confidence level according to the present invention one exemplary embodiment based on semantic and expression
Flow chart.
Fig. 3 is to illustrate to determine the process of confidence level according to another exemplary embodiment of the present invention based on semantic and expression
Flow chart.
Fig. 4 is to illustrate to determine the process of confidence level according to another exemplary embodiment of the present invention based on semantic and expression
Flow chart.
Fig. 5 is the mistake illustrating to generate based on semantic and confidence level response message according to the present invention one exemplary embodiment
The flow chart of journey.
Fig. 6 is the block diagram illustrating the voice interaction device according to the present invention one exemplary embodiment.
Fig. 7 is the block diagram illustrating the electronic equipment according to the present invention one exemplary embodiment.
Specific embodiment
Below, example embodiment according to the application will be described in detail by referring to the drawings.Obviously, described embodiment is only
Only a part of embodiment of the application, rather than the application whole embodiments it should be appreciated that the application be not subject to described herein
The restriction of example embodiment.
Fig. 1 illustrates the general block diagram of the man machine language's exchange method 100 according to the present invention one exemplary embodiment.Here,
" people " can represent human user, and " machine " can represent any type of electronic equipment with human-computer interaction function, including but
It is not limited to mobile electronic device, such as smart mobile phone, flat board, notebook, robot, personal digital assistant, vehicle mounted electric
Sub- equipment, and non-mobile electronic equipment, such as desktop computer, information service terminal, ticket terminals, intelligent appliance equipment,
Intelligent customer service equipment etc..These equipment may be by voice interaction device described herein and method.It should also be understood that
Voice interaction device described herein and method are also applied to the following electronic equipment with voice interactive function developed
In.
With reference to Fig. 1, voice interactive method 100 may begin at step s110 and s112, wherein in step s110, executes voice
The electronic equipment of interaction can receive the first phonetic entry from human user, then in step s112, execution interactive voice
Electronic equipment can receive the first facial expression image input being associated from human user with this first phonetic entry.For example,
Electronic equipment can catch, using mike or microphone array, the voice that human user sends, and simultaneously utilize photographic head
Catch the facial expression image of human user.In most of the cases, human user, when carrying out man-machine interaction, is normally at electronics and sets
Standby dead ahead, therefore electronic equipment will be captured human face expression acquiescence positioned at dead ahead as carrying out voice friendship
The expression of mutual user.In further embodiments, electronic equipment can carry out the people of interactive voice with detect and track
Class user.For example, electronic equipment can carry out interactive voice using microphone array by sound localization technology for detection
The orientation of human user, then rotating camera to be to be directed at this orientation, thus obtaining the facial expression image of this human user.Sound source is fixed
Position for known to the skilled artisan, its ultimate principle is not detailed herein.Using sound localization come detect and track
The technical scheme of user be also described in the applicant Chinese invention patent application 201610341566.x and
201610596000.1 in, the disclosure of which is incorporated herein by reference.
It is understood that the video that the audio signal that captures of mike or microphone array and photographic head capture
Or picture signal can be pretreated and carry timestamp.So, electronic equipment can be based on the time by phonetic entry (sound
Frequency signal) and facial expression image input (video or picture signal) be associated.For example, when electronic equipment has detected phonetic entry
When, the facial expression image input same or about with its time can be extracted.
Next, in step s114, speech recognition can be carried out to determine it to the first received phonetic entry
First is semantic.Here, the first semanteme can be the literal semanteme of the first phonetic entry, i.e. textual representation, and it has been able to utilize
Existing various speech recognition technology and identified with very high accuracy rate.For example, say when human user and " order a tomorrow
Remove the air ticket in Shanghai " when, " ordering an air ticket going to Shanghai tomorrow " this illustration and text juxtaposed setting can be identified by speech recognition technology originally,
Semantic as first.
Additionally, in step s116, image recognition can be carried out to determine to the first received facial expression image input
First expression of human user.For example, it is possible to identify the first of user express one's feelings for happy, anxious, hesitate etc., or user
First expression can be neutral expression, that is, poker-faced.
Here it should be understood that in step s114 and s116, the present invention can use any existing speech recognition skill
Art and image recognition technology.For example, spendable speech recognition technology may include method based on channel model and phonic knowledge,
Method for mode matching and Artificial Neural Network etc., wherein method for mode matching have been obtained for more and more deep grinding
Study carefully, it includes such as dynamic time warping (dtw), hidden Markov method (hmm), vector quantization method (vq) etc..Artificial neuron
Network method is research popular in recent years, and it typically can be used together in conjunction with existing method for mode matching.Spendable
Image recognition technology can be the technology being exclusively used in expression recognition, generally can be divided into following three types: Global estimation drawn game
Portion's method of identification;Deformation extraction method and extraction method;Geometrical measurers and total content control method.Identified with conventional entirety and local
As a example method, Global estimation may include PCA (the prineipal component of such as feature based face
Analysis), Independent component analysis (indendent component analysis), fisher linear discriminant analysis
(fisher ' s linear discriminants), Local Features Analysis method (loeal feature analysis), fisher
The dynamic method (fisher actions) of birth, HMM method (hmm) and clustering methodology;Local estimation may include face
Motion encoded analytic process (facial actions coding system), facial movement parametric method, local principal component analysis method
(local pca), gabor wavelet method and neural network etc..It will also be appreciated that being not limited to example given here, this
Bright can also be using the speech recognition technology of other and following exploitation and face recognition technology.
Next in step s118, can be semantic semantic with described first with the first expression determination based on first being identified
The first associated confidence level.In the present invention, confidence level can be defined as indicating whether described first semanteme is that the mankind use
The amount of the true intention at family, for example, it can be a numerical range, and value is bigger, then show more to determine that this first semanteme is
The true intention of user, value is lower, then show that this first semanteme more uncertain is the true intention that user is intended by, for example
User is also less satisfied to the meaning of this phonetic representation, irresolute in other words.
The target of conventional speech recognition is only that accurately, makes every effort to identify the language literary composition that human user is said exactly
Word, therefore this speech recognition process are " mechanical ", so lead to interactive process be also machinery, be totally different from people with
Exchange between people.The meaning on spoken and written languages surface when carrying out communication between people, can not only be identified, also can watch the mood and guess the thoughts,
Judge mood or the attitude of other side by observing the expression of other side, thus judging whether the language of other side is its true intention table
Show.The general principles of the present invention are that in man-machine interaction, judge that speech recognition is tied by the expression identifying human user
Whether fruit is the true intention of human user, thus realizing the interactive process more like interpersonal exchange.
Specifically, in step s118, can be the first semantic distribution one acquiescence confidence level first.For example, confidence level
Span can be 1 to 10, wherein 10 high one end of expression confidence levels, 1 represents the low one end of confidence level, gives tacit consent to confidence level
The middle part of this span, such as 4-6 can be arranged on.In one example, acquiescence confidence level could be arranged to such as 5.
It is then possible to express one's feelings to adjust distributed acquiescence confidence level according to first being identified.Expression can substantially be divided
For three classes, i.e. certainty expression, negativity expression and neutral expression.Certainty shows that user's word confidence level is high, is
Its true intention represents.For example, when user shows happiness or happiness, pleasantly surprised expression it is believed that confidence level is high.Additionally,
When user shows anxiety, serious expression it is also possible to think that the confidence level that it is spoken is high.Therefore, when the first table being identified
When feelings are expressed one's feelings for these, acquiescence confidence level can be increased.On the other hand, when user shows indignation, detests, disdains or scorn, fears
Fear, grieved, hesitate, surprised, suspect etc. during negativity expression it is believed that its confidence level of speaking is low, therefore reduce and distributed
Acquiescence confidence level.For example, when user says " ordering an air ticket going to Shanghai tomorrow " with glad or serious expression, Yong Huke
This intention can very be determined, the true intention therefore " ordering an air ticket going to Shanghai tomorrow " and be exactly user represents;And work as user
When " ordering an air ticket going to Shanghai tomorrow " is said with hesitation, sad or dejected, angry expression, then it is likely to user also not
Determine whether to the Shanghai that go by aeroplane tomorrow, user's tomorrow is gone by aeroplane this schedule of Shanghai is dissatisfied in other words,
Therefore " ordering an air ticket going to Shanghai tomorrow " may not be that the true intention that user wants represents, so institute now should be reduced
The acquiescence confidence value of distribution.And work as user be neutral expression, for example no special expression when, the acquiescence that can remain distributed is put
Certainty value is constant.
It should be understood that the principle of the present invention is not limited to the specific example of these expressions given here, but can also use
More expressions are it might even be possible to using different expression classification rules, a particular emotion will be categorized as certainty expression, negative
Property expression or neutral expression.
In certain embodiments, each certainty and negativity expression can be divided into different degree or rank again.For example,
For happiness or happiness, smile and can represent the happiness of lower degree, grin and laugh at the happiness that can represent moderate, open one's mouth big
Laugh at the happiness representing higher degree.Degree according to every kind of expression or rank, the adjustment to acquiescence confidence value can also be different.
For example, confidence value can be raised 1 by the certainty expression of lower degree, and moderate certainty expression can be by confidence level
Value rise 2, confidence value can be raised 3 by the certainty expression of higher degree.Certainly it will be appreciated that neutral expression can not
Divide degree or rank.
In certain embodiments, it is also based on the context of interactive voice to adjust distributed acquiescence confidence level.Example
As when the weather that interactive voice content before shows Shanghai tomorrow is heavy rain, then the voice of user " is ordered a tomorrow to go
The confidence level of the air ticket in sea " is just relatively low;On the other hand, if the schedule of interactive voice before or user shows that user is bright
When there is meeting schedule in Shanghai, then the confidence level of the voice " ordering an air ticket going to Shanghai tomorrow " of user is just relatively low for it.Therefore,
Distributed acquiescence confidence value can based on context be adjusted, thus realizing more intelligent confidence value determination process.
After determining the first confidence level of the first semanteme in step s118, in step s120, can be according to first
Semanteme to generate the first response message with the first related confidence level.In conventional interactive process, it is all according to first
Semanteme is generating response message;And in the present invention, due to be realised that first the first semantic related confidence level, that is, be realised that the
One semanteme is the degree that user's true intention represents, therefore can generate different response messages based on this information.At some
In embodiment, when determined by the first confidence level high when, such as more than a predetermined threshold, then be based on first standard generate response
Information, such as regular speech interaction, generates the information being directly associated with the first semanteme.It is appreciated that it is " directly related
Connection " means that this information is based on the information that the first semantic user determining may directly want.For example, if the first language
Justice for " ordering an air ticket going to Shanghai tomorrow " and the first confidence level more than predetermined threshold, then ticket letter more than electronic equipment inquiry
Breath, without remaining ticket, then generates " not having remove the air ticket in Shanghai remaining tomorrow " such response message;If also had
Remaining ticket, then generate " a airline, b airline ... ticket of also having a surplus, airline please be select " such response message.Separately
On the one hand, when determined by the first confidence level low when, such as less than predetermined threshold, then show that the first semanteme is most likely not user
Declaration of will that truly want or satisfied, now electronic equipment can be based on the second standard next life different from the first standard
Become response message, for example, can generate the response message requiring user to confirm the first semanteme.For example, if the first semanteme is for " ordering
One tomorrow removes the air ticket in Shanghai " and the first confidence level is less than predetermined threshold, then electronic equipment can generate and " determine that you will order one
Remove the air ticket in Shanghai tomorrow " such response message, with the chance thinking over again to user.Additionally, determined by working as
When first confidence level is low, electronic equipment can also generate the content for example with the first semantic indirect correlation connection.It is appreciated that "
Connect associated " mean that this information may not be the information directly wanted based on the first semantic user determining, but and user
Information correlation or related to the first semanteme information directly wanted.For example, when the first semanteme is for " ordering a tomorrow to go
The air ticket in Shanghai " and the first confidence level are less than predetermined threshold, then electronic equipment can generate and " determine that you will order a tomorrow and go to Shanghai
Air ticket?The weather in Shanghai will have heavy rain tomorrow " or " determine that you will order an air ticket going to Shanghai tomorrow?Tomorrow you
There is meeting schedule in Beijing " such response message, so that user considers that correlative factor to determine whether the first semanteme is that it is true
It is intended to.
Then, in step s122, by phonetic synthesis (tts) technology, the generated first response message can be synthesized
For voice, play to human user will pass through speaker and/or display, thus completing the interactive voice mistake of one bout
Journey.Equally, the present invention can be repeated no more here using the speech synthesis technique of any existing or following exploitation.
In certain embodiments, according to the first corresponding tone of expression, the first response message can be synthesized language
Sound.For example, when first as user is expressed one's feelings for glad or happy, excited expression, step s122 can also be using glad language
Gas is synthesizing voice;When user is sad, dejected, frightened expression, step s122 can be synthesized using the tone of comfort
Voice;When user be indignation, the angry, expression detesting, disdain when, step s122 then can be synthesized using the tone in a timid manner
Voice.So, the voice response playing to user can be easier to be easily accepted by a user, and contributes to improving the mood of user,
Improve the interactive experience of user.Certainly, the corresponding relation between the tone of synthesis voice and expression is not limited to given here showing
Example, but differently can be defined according to application scenarios.
In the conventional phonetic synthesis with emotion, generally require the semanteme to text and be analyzed, determined by machine and close
Become the emotion needed for voice or the tone.And in the present invention, can directly using the first expression being identified, and adopt corresponding
The tone or emotion, to synthesize voice, are analyzed to determine the process of the tone such that it is able to omit to text, simpler in program,
And the synthesized voice tone can more accurately meet the current mood of user or emotion, so that interactive process is richer
There is genuine human interest, and avoid frosty mechanical sense.
Describe some exemplary embodiments of the present invention above with reference to Fig. 1, it is applied to many general speech exchanges
Scene.But, interpersonal speech exchange is complicated, is likely encountered various special circumstances.To retouch below in conjunction with the accompanying drawings
State man machine language's exchange method that some can tackle similar special screne.
Fig. 2 illustrates according to another exemplary embodiment of the present invention based on first semantic and first expression determination the first confidence
The flow chart of the process 200 of degree.In step s118 above in relation to Fig. 1 description, by being distributed based on the first expression adjustment
Acquiescence confidence level determining the first confidence level.Specifically, when the first expression is certainty expression, increase acquiescence confidence
Degree;When the first expression is negativity expression, reduce acquiescence confidence level;When the first expression is neutral expression, then maintain acquiescence
Confidence level is constant.However, it is contemplated that the complexity of speech exchange, this adjustment mode there may be defect in some cases.
For example, when human user states some sad things with very sad expression, or stated with very terrified expression
During very terrified thing, typically can determine that the confidence level of its language is higher, and should not reduce its confidence level.Therefore, in Fig. 2 institute
In the embodiment shown, whether retrieve in step s210 first in the first semanteme containing key word of being in a bad mood.Emotion key word refers to
The vocabulary can be associated with specific expression or emotion, such as disaster, accident etc. be associated with sad, frightened etc., travel, purchase
Thing etc. is associated with happiness, etc..If not retrieving emotion key word in step s210, execute in step s212
The previously described step to adjust distributed acquiescence confidence level based on the first expression.If retrieving feelings in step s210
Thread key word, then judge in step s214 whether retrieved emotion key word is mated with the first expression.In some enforcements
In example, step s210 may retrieve multiple emotion key words, then can be by each key word and the in step s214
One expression compares, as long as there being an emotion key word to match with the first expression, then judged result is coupling;Only institute is in love
When thread key word and the first expression all mismatch, then judged result is to mismatch.
If the judged result in step s214 is to mismatch, step s216 can execute previously described being based on
First step expressed one's feelings to adjust distributed acquiescence confidence level;If the judged result in step s214 is coupling, this shows
The expression of human user and its voice content match, then it is considered that the confidence level of the first semanteme is very high, then in step
Can directly increase distributed acquiescence confidence level in s218, and can using the confidence level after increasing as with the first semantic phase
First confidence level output of association, for the operation as described by step s120 below.
Described above is and judge, from the content of the first semanteme, the situation that its whether with the first expression matches.In other feelings
It is also contemplated that the type of the first semanteme is carrying out interactive voice in condition.Fig. 3 shows base according to another embodiment of the present invention
Flow chart in the first semantic process 300 determining the first confidence level with the first expression.As shown in figure 3, in step s310, first
First may determine that the semantic type of the first semanteme.For from linguisticss, semantic type is generally divided into three kinds, statement, query and
Require, i.e. assertive sentence, interrogative sentence and imperative sentence, different semantic types corresponds generally to different confidence levels.For example, work as user
When saying interrogative sentence, typically show that it wonders certain answer, therefore confidence level is typically higher;And work as user and saying
When assertive sentence and imperative sentence, typically it is difficult to judge confidence level based on semantic type.
Therefore, if judging that in step s310 the semantic type of the first semanteme is query, can be in step s312
Directly increase distributed acquiescence confidence level, and can be using the confidence level after increasing as first being associated with the first semanteme
Confidence level exports, for the operation as described by step s120 below.On the other hand, if judging in step s310
The semantic type of one semanteme is statement or requires, or perhaps other semantic types in addition to query, then can be in step
The foregoing step to adjust distributed acquiescence confidence level based on the first expression is executed in s314.
Fig. 4 shows a case that to consider emotion key word described above and two kinds of factors 400 of semantic type.With reference to Fig. 4,
The semantic type of the first semanteme is may determine that first in step s410.If the semantic type of the first semanteme is query,
Increase distributed acquiescence confidence level in step s412, and the confidence level after increasing can be associated as with the first semanteme
First confidence level output, for the operation as described by step s120 below.If the semantic type of the first semanteme is
Statement or requirement, or perhaps other semantic types in addition to query, then can be carried out to step s414.
In step s414, can continue whether to judge in the first semanteme containing key word of being in a bad mood.If in the first semanteme
Without key word of being in a bad mood, then execute the previously described step to adjust acquiescence confidence level based on the first expression in step s416
Suddenly;If containing, in the first semanteme, key word of being in a bad mood, continue to judge described emotion key word and the first table in step s418
Whether feelings mate.If it does, then directly increasing distributed acquiescence confidence level in step s420, and after can increasing
Confidence level as with first semanteme be associated first confidence level export, for the behaviour as described by step s120 below
Make;If it does not match, executing the previously described step that acquiescence confidence level is adjusted based on the first expression in step s422.
Fig. 5 illustrate based on first being identified semantic and determined by the first confidence level generate the another of the first response message
The flow chart of one embodiment 500.First, in step s510 it can be determined that whether the first confidence value is more than predetermined threshold.As
Front described, predetermined threshold can be a predetermined confidence standard, when the first confidence value is more than predetermined threshold, then
It is considered that confidence level is high;If the first confidence level is less than predetermined threshold, it is considered that confidence level is low.
When the first confidence level is more than predetermined threshold, then can generate in step s512 and include with the first semanteme directly
First response message of associated content.When the first confidence level is less than predetermined threshold, then can continue in step s514
Confidence value (for convenience, can be described as the second confidence level here) phase by the first confidence level and previous phonetic entry
Relatively.Comparison between first confidence level and the second confidence level before can reflect the feelings of human user during interactive voice
Thread changes.For example, if the first confidence level is more than the second confidence level, although showing absolute confidence, still relatively low (first puts
Reliability is less than threshold value), but relative confidence increases (the first confidence level is more than the second confidence level), so interaction may
Developing toward the good aspect.Now, in step s516, the first response that request human user confirms the first semanteme can be generated
Information.On the other hand, if determine in step s514 the first confidence level be less than before the second confidence level, show not only exhausted
Low to confidence level, and relative confidence is also declining, and interaction may develop to bad direction.Now, in step
The first response message generating in s518 not only can include the content asking human user to confirm the first semanteme, but also permissible
Including the content with the first semantic indirect correlation connection so that user considers and selects.
Below, it is described with reference to Figure 6 the voice interaction device according to the present invention one exemplary embodiment.As it was previously stated, this
The voice interaction device of invention may apply to any type of electronic equipment with human-computer interaction function, including but not limited to
Mobile electronic device, such as smart mobile phone, flat board, notebook, robot, personal digital assistant, vehicle electronic device,
And non-mobile electronic equipment, such as desktop computer, information service terminal, ticket terminals, intelligent appliance equipment, intelligent customer service
Equipment etc..These equipment may be by voice interaction device described herein and method.It should also be understood that being described herein
Voice interaction device be also applied in the electronic equipment with voice interactive function of following exploitation.
As shown in fig. 6, voice interaction device 600 may include sound identification module 610, picture recognition module 620, confidence level
Module 630, response generation module 640 and voice synthetic module 650.Sound identification module 610 is configurable to identification and is derived from people
The first of first phonetic entry of class user is semantic.It is appreciated that sound identification module 610 can utilize any existing, example
As commercially available speech recognition engine, or can also be using the speech recognition engine of following exploitation.Picture recognition module 620
Can be configured to the first table from the first facial expression image input being associated with described first phonetic entry of human user for the identification
Feelings.It is also understood that picture recognition module 620 can be using any existing, for example commercially available facial expression image identification
Engine, or can also be using the facial expression image identification engine of following exploitation.Confidence level module 630 can be based on speech recognition mould
First semantic and picture recognition module 620 identification the first expression of block 610 identification is associated with the described first semanteme to determine
The first confidence level.For example, confidence level module 630 can be the first semantic distribution one acquiescence confidence level first, is then based on the
One expresses one's feelings to adjust distributed acquiescence confidence level, to obtain the first final confidence level.Specifically, when the first expression is willing
During qualitative expression, then increase described acquiescence confidence level;When the first expression is negativity expression, then reduce described acquiescence confidence
Degree;Other expressions outside the first expression is in addition to certainty expression and negativity expression, such as during neutral expression, then maintain
The acquiescence confidence level being distributed is constant.
In certain embodiments, whether confidence level module 630 can also judge the first semanteme containing key word of being in a bad mood, and
The emotion being comprised key word is compared with the first expression.If the emotion key word that the first semanteme comprises and the first expression phase
Coupling, then the confidence level that explanation user speaks is high, therefore directly increases distributed acquiescence confidence level.If the first semanteme does not wrap
Include emotion key word, emotion key word included in other words and the first expression mismatch, then can execute previously described base
Express one's feelings to adjust the operation of distributed acquiescence confidence level in first.
In certain embodiments, confidence level module 630 can also judge the semantic type of the first semanteme.If first is semantic
Semantic type be query then it is assumed that the confidence level spoken of user is high, therefore directly increase distributed acquiescence confidence value;As
Fruit is other semantic types outside query, for example, state or require, then can execute previously described be based on the first expression Lai
The operation of the distributed acquiescence confidence level of adjustment.
In certain embodiments, confidence level module 630 is also based on context to adjust distributed acquiescence confidence level.
For example, if first is semantic consistent with the context of interactive voice, its confidence level is high, therefore increases distributed acquiescence confidence
Degree;On the contrary, if it is inconsistent, reducing distributed acquiescence confidence level.
With continued reference to Fig. 6, the response generation module 640 of voice interaction device 600 can be using from sound identification module
Semantic and from confidence level module 630 the first confidence level of the first of 610 generates the first response message.Response generation module 640
First response message can be generated with different standards according to the first confidence level.In certain embodiments, when the first confidence level
When more than predetermined threshold, then first standard that is based on generates the first response message, for example, generate and include directly phase semantic with first
First response message of the content of association;When the first confidence level is less than predetermined threshold, then second standard that is based on generates the first sound
Answer information, for example, generate the first response message that request human user confirms the first semanteme, or for example generate and also include and the
First response message of the content of one semantic indirect correlation connection.
The process generating response message can be related to using knowledge base 660.Knowledge base 660 can be local knowledge base,
It can be included as a part for speech recognition equipment 600 it is also possible to as shown in Figure 6, is high in the clouds knowledge base
660, speech recognition equipment 600 passes through the network connection of such as wide area network or LAN etc to high in the clouds knowledge base 660.Knowledge base
660 may include various knowledge datas, such as weather data, flight data, hotel's data, movie data, music data, food and drink number
According to, stock certificate data, tourism data, map datum, government bodies' data, domain knowledge, historical knowledge, natural science knowledge, society
Meeting scientific knowledge etc..Response generation module 640 can obtain directly or indirectly related to the first semanteme from knowledge base 660
Knowledge, for generating the first response message.
In certain embodiments, when the first confidence level is more than predetermined threshold, response generation module 640 generate include with
First response message of the content that the first semanteme is directly associated;When the first confidence level is less than predetermined threshold, then response generates
First confidence level is also compared by module 640 with the second confidence level, and described second confidence level is just described with human user
The confidence level that a phonetic entry before first phonetic entry is associated.If the first confidence level is more than the second confidence level,
Then response generation module 640 can generate the first response message that request human user confirms the first semanteme;If the first confidence
Degree is less than the second confidence level, then response generation module 640 can generate request human user and confirms that first is semantic and include and the
First response message of the content of one semantic indirect correlation connection.
Then, the first response message that response generation module 640 is generated can be synthesized language by voice synthetic module 650
Sound, to play to human user by speaker (not shown), thus complete the interactive voice process of one bout.Real at some
Apply in example, voice synthetic module 650 can also express one's feelings to carry out phonetic synthesis using from the first of picture recognition module 620.
Specifically, the first response message can be synthesized language according to the first corresponding tone of expression by voice synthetic module 650
Sound.For example, when first as user is expressed one's feelings for glad or happy, excited expression, voice synthetic module 650 can also be using height
The emerging tone is synthesizing voice;When user is sad, dejected, frightened expression, voice synthetic module 650 can be using comfort
The tone synthesizing voice;When user be indignation, the angry, expression detesting, disdain when, then voice synthetic module 650 can be adopted
Synthesize voice with the tone in a timid manner.So, the voice response playing to user can be easier to be easily accepted by a user, and has
Help improve the mood of user, improve the interactive experience of user.Certainly, voice synthetic module 650 can also be according to other tables
Corresponding relation between feelings and the tone is carrying out phonetic synthesis, and is not limited to example given here.
Fig. 7 illustrates according to the present invention one exemplary embodiment, available voice interaction device described above and method
Electronic equipment block diagram.As shown in fig. 7, electronic equipment 700 may include voice receiving unit 710 and image receiving unit 720.
Voice receiving unit 710 can be such as mike or microphone array, and it can catch the voice of user.Image receiving unit
720 can be such as monocular cam, binocular camera or more purpose photographic head, and it can catch the image of user, especially
It is face image, and therefore image receiving unit 720 can have face recognition function, with the clear table catching user exactly
Feelings image.
As shown in fig. 7, electronic equipment 700 can also include one or more processors 730 and memorizer 740, they lead to
Cross bus system 750 to be connected to each other with voice receiving unit 710 and image receiving unit 720.Processor 730 can be centre
Reason unit (cpu) or have the processing unit of other forms of data-handling capacity and/or instruction execution capability, process cores,
Or controller, and it can be with the miscellaneous part in control electronics 700 to execute desired function.Memorizer 740 is permissible
Including one or more computer programs, described computer program can include various forms of computer-readables and deposit
Storage media, such as volatile memory and/or nonvolatile memory.Described volatile memory for example can include depositing at random
Access to memory (ram) and/or cache memory (cache) etc..Described nonvolatile memory for example can include read-only
Memorizer (rom), hard disk, flash memory etc..One or more computer journeys can be stored on described computer-readable recording medium
Sequence instructs, and processor 740 can run described program instruction, to realize the interactive voice of embodiments herein mentioned above
Method and/or other desired functions.Various application programs can also be stored in described computer-readable recording medium
With various data, such as user data, knowledge data base etc..
Additionally, electronic equipment 700 can also include output unit 760.Output unit 760 can be such as speaker, with
Carry out interactive voice with user.In further embodiments, output unit 760 can also be the output dress such as display, printer
Put.
In addition to said method, device and equipment, present embodiments can also be computer program, its
Including computer program instructions, described computer program instructions make this explanation of described computing device when being run by processor
The step in the voice interactive method according to each embodiment of the application described in book.
Described computer program can be write for holding with the combination in any of one or more programming language
The program code of row the embodiment of the present application operation, described program design language includes object oriented program language, such as
Java, c++ etc., also include the procedural programming language of routine, such as " c " language or similar programming language.Journey
Sequence code fully can execute on an electronic device, partly execute on an electronic device, as an independent software kit
Execute, partly on consumer electronic devices, part executes or on a remote computing completely in remote computing device or clothes
Execute on business device.
Additionally, embodiments herein can also be computer-readable recording medium, it is stored thereon with computer program and refers to
Order, described computer program instructions make when being run by processor the description of described computing device this specification according to this Shen
Please step in the voice interactive method of various embodiments.
Described computer-readable recording medium can adopt the combination in any of one or more computer-readable recording mediums.Computer-readable recording medium can
To be readable signal medium or readable storage medium storing program for executing.Readable storage medium storing program for executing for example can include but is not limited to electricity, magnetic, light, electricity
The system of magnetic, infrared ray or quasiconductor, device or device, or arbitrarily above combination.Readable storage medium storing program for executing is more specifically
Example (non exhaustive list) includes: has the electrical connection of one or more wires, portable disc, hard disk, random access memory
Device (ram), read only memory (rom), erasable programmable read only memory (eprom or flash memory), optical fiber, portable compact disc
Read only memory (cd-rom), light storage device, magnetic memory device or above-mentioned any appropriate combination.
Describe the ultimate principle of the application above in association with specific embodiment, however, it is desirable to it is noted that in this application
The advantage that refers to, advantage, effect etc. are only exemplary rather than limiting it is impossible to think that these advantages, advantage, effect etc. are the application
Each embodiment is prerequisite.In addition, detail disclosed above is merely to the effect of example and the work readily appreciating
With, and unrestricted, it is must to be realized using above-mentioned concrete details that above-mentioned details is not intended to limit the application.
The device that is related in the application, device, equipment, the block diagram of system are only used as exemplary example and are not intended to
Require or hint must be attached, arrange, configure according to the mode that square frame illustrates.As it would be recognized by those skilled in the art that
, can be connected, be arranged by any-mode, configure these devices, device, equipment, system.Such as " include ", "comprising", " tool
Have " etc. word be open vocabulary, refer to " including but not limited to ", and can be with its used interchangeably.Vocabulary used herein above
"or" and " and " refer to vocabulary "and/or", and can be with its used interchangeably, unless it is not such that context is explicitly indicated.Here made
Vocabulary " such as " refers to phrase " such as, but not limited to ", and can be with its used interchangeably.
It may also be noted that in the equipment and method of the application, each part or each step are can to decompose and/or weight
Combination nova.These decompose and/or reconfigure the equivalents that should be regarded as the application.
There is provided the above description of disclosed aspect so that any person skilled in the art can make or using this
Application.Various modifications to these aspects are readily apparent to those skilled in the art, and here definition
General Principle can apply to other aspects without deviating from scope of the present application.Therefore, the application is not intended to be limited to
Aspect shown in this, but according to the widest range consistent with principle disclosed herein and novel feature.
In order to purpose of illustration and description has been presented for above description.Additionally, this description is not intended to the reality of the application
Apply example and be restricted to form disclosed herein.Although already discussed above multiple exemplary aspect and embodiment, this area skill
Art personnel will be recognized that its some modification, modification, change, interpolation and sub-portfolio.
Claims (15)
1. a kind of voice interactive method, comprising:
Receive the first facial expression image be associated from the first phonetic entry of human user and with described first phonetic entry defeated
Enter;
Identify the first semanteme of described first phonetic entry;
Identify the first expression of described first facial expression image input;
The first confidence level being associated with the described first semanteme is determined based on the described first semantic and described first expression;And
First response message is generated based on the described first semantic and described first confidence level.
2. the method for claim 1, wherein determine that the first confidence level being associated with the described first semanteme is included:
For the described first semantic distribution one acquiescence confidence level;And
Express one's feelings to adjust described acquiescence confidence level based on described first, comprising:
When the described first expression is certainty expression, increase described acquiescence confidence level;
When the described first expression is negativity expression, reduce described acquiescence confidence level;And
When the described first expression is in addition to the neutral expression outside described certainty expression is expressed one's feelings with described negativity, maintain institute
State acquiescence confidence level constant.
3. the method for claim 1, wherein determine that the first confidence level being associated with the described first semanteme is also included:
Whether judge in described first semanteme containing key word of being in a bad mood;
If not containing key word of being in a bad mood in described first semanteme, express one's feelings based on described first described in execution described silent to adjust
The step recognizing confidence level;
If contain in described first semanteme be in a bad mood key word, judge described emotion key word and described first expression whether
Join;
If described emotion key word is matched with the described first expression, increase described acquiescence confidence level;And
If described emotion key word is mismatched with described first expression, based on described first expression to adjust described in execution
The step stating acquiescence confidence level.
4. the method for claim 1, determines that the first confidence level being associated with the described first semanteme is also included:
Judge the semantic type of described first semanteme;
If the semantic type of described first semanteme is query, increase described acquiescence confidence level;And
If the semantic type of described first semanteme is statement or requires, execution is described to be based on described first expression to adjust
The step stating acquiescence confidence level.
5. the method for claim 1, wherein it is based on the described first semantic and described first confidence level and generate the first sound
Information is answered to include:
When described first confidence level is more than predetermined threshold, then generate the content including being directly associated with described first semanteme
The first response message;
When described first confidence level is less than described predetermined threshold, then generates the described human user of request and confirm that described first is semantic
The first response message.
6. method as claimed in claim 5, wherein, when described first confidence level is less than the being generated during described predetermined threshold
One response message also includes the content with the described first semantic indirect correlation connection.
7. the method for claim 1, wherein it is based on the described first semantic and described first confidence level and generate the first sound
Information is answered to include:
When described first confidence level is more than predetermined threshold, then generate the content including being directly associated with described first semanteme
The first response message;
When described first confidence level is less than described predetermined threshold, then described first confidence level is compared with the second confidence level,
Described second confidence level is to be associated with a just phonetic entry before described first phonetic entry of described human user
Confidence level;
If described first confidence level is more than described second confidence level, generates the described human user of request and confirm described first
The first semantic response message;And
If described first confidence level is less than described second confidence level, generates the described human user of request and confirm described first language
Justice and include the first response message with the content of the described first semantic indirect correlation connection.
8. the method for claim 1, also includes responding described first according to the described first corresponding tone of expression
Information synthesizes voice to play to described human user.
9. a kind of voice interaction device, comprising:
Sound identification module, is configured to identify that first of the first phonetic entry from human user is semantic;
Picture recognition module, is configured to identify the first table being associated with described first phonetic entry from described human user
First expression of feelings image input;
Confidence level unit, is configured to the described first semantic and described first expression and is associated with the described first semanteme to determine
The first confidence level;And
Response signal generating unit, is configured to the described first semantic and described first confidence level and generates the first response message.
10. device as claimed in claim 9, wherein, described confidence level unit is configured to pass execution following steps to determine
The first confidence level being associated with the described first semanteme:
For the described first semantic distribution one acquiescence confidence level;And
Express one's feelings to adjust described acquiescence confidence level based on described first, comprising:
When the described first expression is certainty expression, increase described acquiescence confidence level;
When the described first expression is negativity expression, reduce described acquiescence confidence level;And
When the described first expression is in addition to the neutral expression outside described certainty expression is expressed one's feelings with described negativity, maintain institute
State acquiescence confidence level constant.
11. devices as claimed in claim 10, wherein, described confidence level unit is additionally configured to by executing following steps Lai really
Fixed the first confidence level being associated with the described first semanteme:
Judge the semantic type of described first semanteme;
If the semantic type of described first semanteme is query, increase described acquiescence confidence level;
If the semantic type of described first semanteme is statement or requires, whether judge in described first semanteme containing pass of being in a bad mood
Keyword;
If not containing key word of being in a bad mood in described first semanteme, express one's feelings based on described first described in execution described silent to adjust
The step recognizing confidence level;
If contain in described first semanteme be in a bad mood key word, judge described emotion key word and described first expression whether
Join;
If described emotion key word is matched with the described first expression, increase described acquiescence confidence level;And
If described emotion key word is mismatched with described first expression, based on described first expression to adjust described in execution
The step stating acquiescence confidence level.
12. devices as claimed in claim 9, wherein, described response generation module is configured to pass execution following steps next life
Become described first response message:
When described first confidence level is more than predetermined threshold, then generate the content including being directly associated with described first semanteme
The first response message;
When described first confidence level is less than described predetermined threshold, then generates the described human user of request and confirm that described first is semantic
The first response message.
13. devices as claimed in claim 12, wherein, when described first confidence level is less than described predetermined threshold, described sound
The first response message that generation module generates is answered also to include the content with the described first semantic indirect correlation connection.
14. a kind of electronic equipments, comprising:
Voice receiving unit;
Image receiving unit;
Memorizer;And
Processor, is connected with described voice receiving unit, described image receiving unit and described memorizer each other by bus system
Connect, described processor is configured to run the instruction being stored on described memorizer with any one of perform claim requirement 1-8 institute
The method stated.
A kind of 15. computer programs, including computer program instructions, described computer program instructions are being run by processor
When make method according to any one of claim 1-8 for the described computing device.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610806384.5A CN106373569B (en) | 2016-09-06 | 2016-09-06 | Voice interaction device and method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610806384.5A CN106373569B (en) | 2016-09-06 | 2016-09-06 | Voice interaction device and method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106373569A true CN106373569A (en) | 2017-02-01 |
CN106373569B CN106373569B (en) | 2019-12-20 |
Family
ID=57900064
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610806384.5A Active CN106373569B (en) | 2016-09-06 | 2016-09-06 | Voice interaction device and method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106373569B (en) |
Cited By (30)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106910514A (en) * | 2017-04-30 | 2017-06-30 | 上海爱优威软件开发有限公司 | Method of speech processing and system |
CN107199572A (en) * | 2017-06-16 | 2017-09-26 | 山东大学 | A kind of robot system and method based on intelligent auditory localization and Voice command |
CN107240398A (en) * | 2017-07-04 | 2017-10-10 | 科大讯飞股份有限公司 | Intelligent sound exchange method and device |
CN108320738A (en) * | 2017-12-18 | 2018-07-24 | 上海科大讯飞信息科技有限公司 | Voice data processing method and device, storage medium, electronic equipment |
CN108564943A (en) * | 2018-04-27 | 2018-09-21 | 京东方科技集团股份有限公司 | voice interactive method and system |
CN108833941A (en) * | 2018-06-29 | 2018-11-16 | 北京百度网讯科技有限公司 | Man-machine dialogue system method, apparatus, user terminal, processing server and system |
CN108833721A (en) * | 2018-05-08 | 2018-11-16 | 广东小天才科技有限公司 | Emotion analysis method based on call, user terminal and system |
CN109005304A (en) * | 2017-06-07 | 2018-12-14 | 中兴通讯股份有限公司 | A kind of queuing strategy and device, computer readable storage medium |
CN109240488A (en) * | 2018-07-27 | 2019-01-18 | 重庆柚瓣家科技有限公司 | A kind of implementation method of AI scene engine of positioning |
CN109741738A (en) * | 2018-12-10 | 2019-05-10 | 平安科技(深圳)有限公司 | Sound control method, device, computer equipment and storage medium |
CN109783669A (en) * | 2019-01-21 | 2019-05-21 | 美的集团武汉制冷设备有限公司 | Screen methods of exhibiting, robot and computer readable storage medium |
CN109979462A (en) * | 2019-03-21 | 2019-07-05 | 广东小天才科技有限公司 | Method and system for obtaining intention by combining context |
WO2019200584A1 (en) * | 2018-04-19 | 2019-10-24 | Microsoft Technology Licensing, Llc | Generating response in conversation |
CN110491383A (en) * | 2019-09-25 | 2019-11-22 | 北京声智科技有限公司 | A kind of voice interactive method, device, system, storage medium and processor |
CN110546630A (en) * | 2017-03-31 | 2019-12-06 | 三星电子株式会社 | Method for providing information and electronic device supporting the same |
CN110931006A (en) * | 2019-11-26 | 2020-03-27 | 深圳壹账通智能科技有限公司 | Intelligent question-answering method based on emotion analysis and related equipment |
CN111210818A (en) * | 2019-12-31 | 2020-05-29 | 北京三快在线科技有限公司 | Word acquisition method and device matched with emotion polarity and electronic equipment |
WO2020119569A1 (en) * | 2018-12-11 | 2020-06-18 | 阿里巴巴集团控股有限公司 | Voice interaction method, device and system |
CN111428017A (en) * | 2020-03-24 | 2020-07-17 | 科大讯飞股份有限公司 | Human-computer interaction optimization method and related device |
CN111710326A (en) * | 2020-06-12 | 2020-09-25 | 携程计算机技术(上海)有限公司 | English voice synthesis method and system, electronic equipment and storage medium |
CN111883127A (en) * | 2020-07-29 | 2020-11-03 | 百度在线网络技术(北京)有限公司 | Method and apparatus for processing speech |
CN112106381A (en) * | 2018-05-17 | 2020-12-18 | 高通股份有限公司 | User experience assessment |
CN112235180A (en) * | 2020-08-29 | 2021-01-15 | 上海量明科技发展有限公司 | Voice message processing method and device and instant messaging client |
CN112307816A (en) * | 2019-07-29 | 2021-02-02 | 北京地平线机器人技术研发有限公司 | In-vehicle image acquisition method and device, electronic equipment and storage medium |
CN112687260A (en) * | 2020-11-17 | 2021-04-20 | 珠海格力电器股份有限公司 | Facial-recognition-based expression judgment voice recognition method, server and air conditioner |
CN112804440A (en) * | 2019-11-13 | 2021-05-14 | 北京小米移动软件有限公司 | Method, device and medium for processing image |
CN113435338A (en) * | 2021-06-28 | 2021-09-24 | 平安科技(深圳)有限公司 | Voting classification method and device, electronic equipment and readable storage medium |
CN113823282A (en) * | 2019-06-26 | 2021-12-21 | 百度在线网络技术(北京)有限公司 | Voice processing method, system and device |
CN114842842A (en) * | 2022-03-25 | 2022-08-02 | 青岛海尔科技有限公司 | Voice interaction method and device of intelligent equipment and storage medium |
CN115497474A (en) * | 2022-09-13 | 2022-12-20 | 广东浩博特科技股份有限公司 | Control method based on voice recognition |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR101423258B1 (en) * | 2012-11-27 | 2014-07-24 | 포항공과대학교 산학협력단 | Method for supplying consulting communication and apparatus using the method |
CN104038836A (en) * | 2014-06-03 | 2014-09-10 | 四川长虹电器股份有限公司 | Television program intelligent pushing method |
CN105244023A (en) * | 2015-11-09 | 2016-01-13 | 上海语知义信息技术有限公司 | System and method for reminding teacher emotion in classroom teaching |
CN105334743A (en) * | 2015-11-18 | 2016-02-17 | 深圳创维-Rgb电子有限公司 | Intelligent home control method and system based on emotion recognition |
CN105389309A (en) * | 2014-09-03 | 2016-03-09 | 曲阜师范大学 | Music regulation system driven by emotional semantic recognition based on cloud fusion |
CN105895101A (en) * | 2016-06-08 | 2016-08-24 | 国网上海市电力公司 | Speech processing equipment and processing method for power intelligent auxiliary service system |
-
2016
- 2016-09-06 CN CN201610806384.5A patent/CN106373569B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR101423258B1 (en) * | 2012-11-27 | 2014-07-24 | 포항공과대학교 산학협력단 | Method for supplying consulting communication and apparatus using the method |
CN104038836A (en) * | 2014-06-03 | 2014-09-10 | 四川长虹电器股份有限公司 | Television program intelligent pushing method |
CN105389309A (en) * | 2014-09-03 | 2016-03-09 | 曲阜师范大学 | Music regulation system driven by emotional semantic recognition based on cloud fusion |
CN105244023A (en) * | 2015-11-09 | 2016-01-13 | 上海语知义信息技术有限公司 | System and method for reminding teacher emotion in classroom teaching |
CN105334743A (en) * | 2015-11-18 | 2016-02-17 | 深圳创维-Rgb电子有限公司 | Intelligent home control method and system based on emotion recognition |
CN105895101A (en) * | 2016-06-08 | 2016-08-24 | 国网上海市电力公司 | Speech processing equipment and processing method for power intelligent auxiliary service system |
Cited By (45)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110546630B (en) * | 2017-03-31 | 2023-12-05 | 三星电子株式会社 | Method for providing information and electronic device supporting the same |
CN110546630A (en) * | 2017-03-31 | 2019-12-06 | 三星电子株式会社 | Method for providing information and electronic device supporting the same |
CN106910514A (en) * | 2017-04-30 | 2017-06-30 | 上海爱优威软件开发有限公司 | Method of speech processing and system |
CN109005304A (en) * | 2017-06-07 | 2018-12-14 | 中兴通讯股份有限公司 | A kind of queuing strategy and device, computer readable storage medium |
CN107199572A (en) * | 2017-06-16 | 2017-09-26 | 山东大学 | A kind of robot system and method based on intelligent auditory localization and Voice command |
CN107199572B (en) * | 2017-06-16 | 2020-02-14 | 山东大学 | Robot system and method based on intelligent sound source positioning and voice control |
CN107240398A (en) * | 2017-07-04 | 2017-10-10 | 科大讯飞股份有限公司 | Intelligent sound exchange method and device |
CN107240398B (en) * | 2017-07-04 | 2020-11-17 | 科大讯飞股份有限公司 | Intelligent voice interaction method and device |
CN108320738B (en) * | 2017-12-18 | 2021-03-02 | 上海科大讯飞信息科技有限公司 | Voice data processing method and device, storage medium and electronic equipment |
CN108320738A (en) * | 2017-12-18 | 2018-07-24 | 上海科大讯飞信息科技有限公司 | Voice data processing method and device, storage medium, electronic equipment |
WO2019200584A1 (en) * | 2018-04-19 | 2019-10-24 | Microsoft Technology Licensing, Llc | Generating response in conversation |
CN110998725A (en) * | 2018-04-19 | 2020-04-10 | 微软技术许可有限责任公司 | Generating responses in a conversation |
CN110998725B (en) * | 2018-04-19 | 2024-04-12 | 微软技术许可有限责任公司 | Generating a response in a dialog |
US11922934B2 (en) | 2018-04-19 | 2024-03-05 | Microsoft Technology Licensing, Llc | Generating response in conversation |
CN108564943A (en) * | 2018-04-27 | 2018-09-21 | 京东方科技集团股份有限公司 | voice interactive method and system |
CN108833721B (en) * | 2018-05-08 | 2021-03-12 | 广东小天才科技有限公司 | Emotion analysis method based on call, user terminal and system |
CN108833721A (en) * | 2018-05-08 | 2018-11-16 | 广东小天才科技有限公司 | Emotion analysis method based on call, user terminal and system |
CN112106381B (en) * | 2018-05-17 | 2023-12-01 | 高通股份有限公司 | User experience assessment method, device and equipment |
CN112106381A (en) * | 2018-05-17 | 2020-12-18 | 高通股份有限公司 | User experience assessment |
US11282516B2 (en) | 2018-06-29 | 2022-03-22 | Beijing Baidu Netcom Science Technology Co., Ltd. | Human-machine interaction processing method and apparatus thereof |
CN108833941A (en) * | 2018-06-29 | 2018-11-16 | 北京百度网讯科技有限公司 | Man-machine dialogue system method, apparatus, user terminal, processing server and system |
CN109240488A (en) * | 2018-07-27 | 2019-01-18 | 重庆柚瓣家科技有限公司 | A kind of implementation method of AI scene engine of positioning |
CN109741738A (en) * | 2018-12-10 | 2019-05-10 | 平安科技(深圳)有限公司 | Sound control method, device, computer equipment and storage medium |
WO2020119569A1 (en) * | 2018-12-11 | 2020-06-18 | 阿里巴巴集团控股有限公司 | Voice interaction method, device and system |
CN109783669A (en) * | 2019-01-21 | 2019-05-21 | 美的集团武汉制冷设备有限公司 | Screen methods of exhibiting, robot and computer readable storage medium |
CN109979462A (en) * | 2019-03-21 | 2019-07-05 | 广东小天才科技有限公司 | Method and system for obtaining intention by combining context |
CN113823282A (en) * | 2019-06-26 | 2021-12-21 | 百度在线网络技术(北京)有限公司 | Voice processing method, system and device |
CN112307816A (en) * | 2019-07-29 | 2021-02-02 | 北京地平线机器人技术研发有限公司 | In-vehicle image acquisition method and device, electronic equipment and storage medium |
CN110491383B (en) * | 2019-09-25 | 2022-02-18 | 北京声智科技有限公司 | Voice interaction method, device and system, storage medium and processor |
CN110491383A (en) * | 2019-09-25 | 2019-11-22 | 北京声智科技有限公司 | A kind of voice interactive method, device, system, storage medium and processor |
CN112804440A (en) * | 2019-11-13 | 2021-05-14 | 北京小米移动软件有限公司 | Method, device and medium for processing image |
CN110931006A (en) * | 2019-11-26 | 2020-03-27 | 深圳壹账通智能科技有限公司 | Intelligent question-answering method based on emotion analysis and related equipment |
WO2021135140A1 (en) * | 2019-12-31 | 2021-07-08 | 北京三快在线科技有限公司 | Word collection method matching emotion polarity |
CN111210818A (en) * | 2019-12-31 | 2020-05-29 | 北京三快在线科技有限公司 | Word acquisition method and device matched with emotion polarity and electronic equipment |
CN111428017A (en) * | 2020-03-24 | 2020-07-17 | 科大讯飞股份有限公司 | Human-computer interaction optimization method and related device |
CN111428017B (en) * | 2020-03-24 | 2022-12-02 | 科大讯飞股份有限公司 | Human-computer interaction optimization method and related device |
CN111710326A (en) * | 2020-06-12 | 2020-09-25 | 携程计算机技术(上海)有限公司 | English voice synthesis method and system, electronic equipment and storage medium |
CN111710326B (en) * | 2020-06-12 | 2024-01-23 | 携程计算机技术(上海)有限公司 | English voice synthesis method and system, electronic equipment and storage medium |
CN111883127A (en) * | 2020-07-29 | 2020-11-03 | 百度在线网络技术(北京)有限公司 | Method and apparatus for processing speech |
CN112235180A (en) * | 2020-08-29 | 2021-01-15 | 上海量明科技发展有限公司 | Voice message processing method and device and instant messaging client |
CN112687260A (en) * | 2020-11-17 | 2021-04-20 | 珠海格力电器股份有限公司 | Facial-recognition-based expression judgment voice recognition method, server and air conditioner |
CN113435338A (en) * | 2021-06-28 | 2021-09-24 | 平安科技(深圳)有限公司 | Voting classification method and device, electronic equipment and readable storage medium |
CN113435338B (en) * | 2021-06-28 | 2024-07-19 | 平安科技(深圳)有限公司 | Voting classification method, voting classification device, electronic equipment and readable storage medium |
CN114842842A (en) * | 2022-03-25 | 2022-08-02 | 青岛海尔科技有限公司 | Voice interaction method and device of intelligent equipment and storage medium |
CN115497474A (en) * | 2022-09-13 | 2022-12-20 | 广东浩博特科技股份有限公司 | Control method based on voice recognition |
Also Published As
Publication number | Publication date |
---|---|
CN106373569B (en) | 2019-12-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106373569A (en) | Voice interaction apparatus and method | |
US11270695B2 (en) | Augmentation of key phrase user recognition | |
CN108701453B (en) | Modular deep learning model | |
US11715485B2 (en) | Artificial intelligence apparatus for converting text and speech in consideration of style and method for the same | |
US11488576B2 (en) | Artificial intelligence apparatus for generating text or speech having content-based style and method for the same | |
CN107481720B (en) | Explicit voiceprint recognition method and device | |
CN116547746A (en) | Dialog management for multiple users | |
US20190341058A1 (en) | Joint neural network for speaker recognition | |
KR100586767B1 (en) | System and method for multi-modal focus detection, referential ambiguity resolution and mood classification using multi-modal input | |
Schuller et al. | Audiovisual recognition of spontaneous interest within conversations | |
CN112074901A (en) | Speech recognition login | |
WO2018048549A1 (en) | Method and system of automatic speech recognition using posterior confidence scores | |
CN111898670B (en) | Multi-mode emotion recognition method, device, equipment and storage medium | |
KR20200113105A (en) | Electronic device providing a response and method of operating the same | |
KR20210155401A (en) | Speech synthesis apparatus for evaluating the quality of synthesized speech using artificial intelligence and method of operation thereof | |
EP3841460B1 (en) | Electronic device and method for controlling the same | |
KR20200027331A (en) | Voice synthesis device | |
Teye et al. | Evaluation of conversational agents: understanding culture, context and environment in emotion detection | |
US20220375469A1 (en) | Intelligent voice recognition method and apparatus | |
US20220417047A1 (en) | Machine-learning-model based name pronunciation | |
KR20190093962A (en) | Speech signal processing mehtod for speaker recognition and electric apparatus thereof | |
US20210337274A1 (en) | Artificial intelligence apparatus and method for providing visual information | |
Jiang et al. | Target Speech Diarization with Multimodal Prompts | |
KR20230120790A (en) | Speech Recognition Healthcare Service Using Variable Language Model | |
JP6114210B2 (en) | Speech recognition apparatus, feature quantity conversion matrix generation apparatus, speech recognition method, feature quantity conversion matrix generation method, and program |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |