CN110085211A - Speech recognition exchange method, device, computer equipment and storage medium - Google Patents
Speech recognition exchange method, device, computer equipment and storage medium Download PDFInfo
- Publication number
- CN110085211A CN110085211A CN201810079431.XA CN201810079431A CN110085211A CN 110085211 A CN110085211 A CN 110085211A CN 201810079431 A CN201810079431 A CN 201810079431A CN 110085211 A CN110085211 A CN 110085211A
- Authority
- CN
- China
- Prior art keywords
- emotion identification
- identification result
- confidence level
- text
- result
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 116
- 238000003860 storage Methods 0.000 title claims abstract description 10
- 230000008451 emotion Effects 0.000 claims abstract description 1154
- 230000002452 interceptive effect Effects 0.000 claims abstract description 104
- 238000004458 analytical method Methods 0.000 claims abstract description 10
- 230000036651 mood Effects 0.000 claims description 737
- 230000002996 emotional effect Effects 0.000 claims description 154
- 235000013399 edible fruits Nutrition 0.000 claims description 49
- 238000004364 calculation method Methods 0.000 claims description 23
- 238000012545 processing Methods 0.000 claims description 14
- 238000004590 computer program Methods 0.000 claims description 13
- 238000013136 deep learning model Methods 0.000 claims description 11
- 238000013135 deep learning Methods 0.000 claims 1
- 230000003993 interaction Effects 0.000 abstract description 26
- 230000008569 process Effects 0.000 description 37
- 238000010586 diagram Methods 0.000 description 28
- 238000004422 calculation algorithm Methods 0.000 description 12
- 239000002609 medium Substances 0.000 description 10
- 208000019901 Anxiety disease Diseases 0.000 description 8
- 230000036506 anxiety Effects 0.000 description 8
- 230000000694 effects Effects 0.000 description 6
- 239000000284 extract Substances 0.000 description 6
- 238000012512 characterization method Methods 0.000 description 5
- 230000006870 function Effects 0.000 description 5
- 238000013019 agitation Methods 0.000 description 4
- 230000008859 change Effects 0.000 description 4
- 238000012549 training Methods 0.000 description 4
- 206010027940 Mood altered Diseases 0.000 description 3
- 238000005311 autocorrelation function Methods 0.000 description 3
- 238000001514 detection method Methods 0.000 description 3
- 238000000605 extraction Methods 0.000 description 3
- 238000001228 spectrum Methods 0.000 description 3
- 230000002708 enhancing effect Effects 0.000 description 2
- 230000036541 health Effects 0.000 description 2
- 238000012880 independent component analysis Methods 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 230000004044 response Effects 0.000 description 2
- 238000005070 sampling Methods 0.000 description 2
- 238000012706 support-vector machine Methods 0.000 description 2
- 208000024891 symptom Diseases 0.000 description 2
- 230000001755 vocal effect Effects 0.000 description 2
- 108010001267 Protein Subunits Proteins 0.000 description 1
- 230000009471 action Effects 0.000 description 1
- 230000004913 activation Effects 0.000 description 1
- HUTDUHSNJYTCAR-UHFFFAOYSA-N ancymidol Chemical compound C1=CC(OC)=CC=C1C(O)(C=1C=NC=NC=1)C1CC1 HUTDUHSNJYTCAR-UHFFFAOYSA-N 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000002612 cardiopulmonary effect Effects 0.000 description 1
- 238000005266 casting Methods 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 239000003086 colorant Substances 0.000 description 1
- 230000008878 coupling Effects 0.000 description 1
- 238000010168 coupling process Methods 0.000 description 1
- 238000005859 coupling reaction Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000005242 forging Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000007689 inspection Methods 0.000 description 1
- 230000014759 maintenance of location Effects 0.000 description 1
- 206010025482 malaise Diseases 0.000 description 1
- 238000011430 maximum method Methods 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 239000012120 mounting media Substances 0.000 description 1
- 210000005036 nerve Anatomy 0.000 description 1
- 230000000149 penetrating effect Effects 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 238000005086 pumping Methods 0.000 description 1
- 238000013139 quantization Methods 0.000 description 1
- 238000007670 refining Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000029058 respiratory gaseous exchange Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 229910052709 silver Inorganic materials 0.000 description 1
- 239000004332 silver Substances 0.000 description 1
- 230000003595 spectral effect Effects 0.000 description 1
- 230000002459 sustained effect Effects 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/01—Input arrangements or combined input and output arrangements for interaction between user and computer
- G06F3/011—Arrangements for interaction with the human body, e.g. for user immersion in virtual reality
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/63—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
- G10L2015/0638—Interactive procedures
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- General Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Child & Adolescent Psychology (AREA)
- General Health & Medical Sciences (AREA)
- Hospice & Palliative Care (AREA)
- Psychiatry (AREA)
- Signal Processing (AREA)
- Machine Translation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The embodiment of the invention provides a kind of speech recognition exchange method, device, computer equipment and storage medium, solve the problems, such as that intelligent interaction mode in the prior art can not analyze the profound of user speech message and be intended to and can not provide more humane interactive experience.The speech recognition exchange method includes: to obtain Emotion identification result according to user speech message, wherein, audio Emotion identification is included at least in the Emotion identification result as a result, or including at least audio Emotion identification result and the text Emotion identification result in the Emotion identification result;Intention analysis is carried out according to the content of text of the user speech message, obtains corresponding basic intent information;And corresponding interactive instruction is determined according to the Emotion identification result and the basic intent information.
Description
Technical field
The present invention relates to technical field of intelligent interaction, and in particular to a kind of speech recognition exchange method, device, computer are set
Standby and storage medium.
Background technique
With the continuous improvement that the continuous development and people of artificial intelligence technology require interactive experience, intelligent interaction
Mode gradually starts to substitute some traditional man-machine interaction modes, and has become a research hotspot.However, existing intelligence
Interactive mode is only capable of probably analyzing the semantic content of user speech message, and the current feelings of user can not be identified by voice
Not-ready status, thus the profound feelings that user speech message actually wants to expression can not be analyzed according to the emotional state of user
Thread demand can not also provide more humane interactive experience according to user speech message.For example, making up for lost time for one
Emotional state is that the emotional state that anxious user just has started to do stroke planning with one is gentle user, is ask by voice
Ask that desired obtained reply mode is different certainly when air flight times information, and according to existing semantic-based intelligence
Energy interactive mode, the obtained reply mode of different users is identical, such as only corresponding air flight times information journey
Sequence is to user.
Summary of the invention
In view of this, the embodiment of the invention provides a kind of speech recognition exchange method, device, computer equipment and storages
Medium, solves intelligent interaction mode in the prior art and can not analyze the profound of user speech message and be intended to and can not
The problem of more humane interactive experience is provided.
One embodiment of the invention provide a kind of speech recognition exchange method include:
Emotion identification result is obtained according to user speech message, wherein audio is included at least in the Emotion identification result
Emotion identification is as a result, or including at least audio Emotion identification result and the text Emotion identification knot in the Emotion identification result
Fruit;
Intention analysis is carried out according to the content of text of the user speech message, obtains corresponding basic intent information;With
And
Corresponding interactive instruction is determined according to the Emotion identification result and the basic intent information.
One embodiment of the invention provide a kind of speech recognition interactive device include:
Emotion identification module is configured to obtain Emotion identification result according to user speech message, wherein the Emotion identification
As a result in include at least audio Emotion identification as a result, or in the Emotion identification result include at least audio Emotion identification result and
The text Emotion identification result;
Basic intention assessment module, is configured to carry out intention analysis according to the content of text of the user speech message, obtain
To corresponding basic intent information;And
Interactive instruction determining module is configured to determine and correspond to according to the Emotion identification result and the basic intent information
Interactive instruction.
A kind of computer equipment that one embodiment of the invention provides includes: memory, processor and is stored in described deposit
The computer program executed on reservoir by the processor, the processor are realized as previously described when executing the computer program
The step of method.
A kind of computer readable storage medium that one embodiment of the invention provides, is stored thereon with computer program, described
The step of method as previously described is realized when computer program is executed by processor.
It a kind of speech recognition exchange method provided in an embodiment of the present invention, device, computer equipment and computer-readable deposits
Storage media is combined and is obtained based on user speech message on the basis of understanding the basic intent information of user speech message
Emotion identification as a result, simultaneously further provide the interactive instruction that band is in a bad mood according to basic intent information and Emotion identification result,
The profound level that user speech message can not be analyzed to solve intelligent interaction mode in the prior art is intended to and can not
The problem of more humane interactive experience is provided.
Detailed description of the invention
Fig. 1 show a kind of flow diagram of speech recognition exchange method of one embodiment of the invention offer.
Fig. 2 show the process that Emotion identification result is determined in the speech recognition exchange method of one embodiment of the invention offer
Schematic diagram.
Fig. 3 show the process that Emotion identification result is determined in the speech recognition exchange method of one embodiment of the invention offer
Schematic diagram.User speech message in the embodiment also includes at least user speech message, Emotion identification
Fig. 4 show another embodiment of the present invention provides speech recognition exchange method in determine Emotion identification result stream
Journey schematic diagram.
Fig. 5 show the text in the speech recognition exchange method of one embodiment of the invention offer according to user speech message
The flow diagram of content acquisition text Emotion identification result.
Fig. 6 show the text in the speech recognition exchange method of one embodiment of the invention offer according to user speech message
The flow diagram of content acquisition text Emotion identification result.
Fig. 7, which is shown in the speech recognition exchange method of one embodiment of the invention offer, determines text Emotion identification result
Flow diagram.
Fig. 8 show another embodiment of the present invention provides speech recognition exchange method in determine text Emotion identification result
Flow diagram.
Fig. 9 show in the speech recognition exchange method of one embodiment of the invention offer and obtains base according to user speech message
The flow diagram of this intent information.
Figure 10 is a kind of structural schematic diagram for speech recognition interactive device that one embodiment of the invention provides.
Figure 11 is a kind of structural schematic diagram for speech recognition interactive device that one embodiment of the invention provides.
Figure 12 is a kind of structural schematic diagram for speech recognition interactive device that one embodiment of the invention provides.
Figure 13 is a kind of structural schematic diagram for speech recognition interactive device that one embodiment of the invention provides.
Figure 14 is a kind of structural schematic diagram for speech recognition interactive device that one embodiment of the invention provides.
Figure 15 is a kind of structural schematic diagram for speech recognition interactive device that one embodiment of the invention provides.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete
Site preparation description, it is clear that the described embodiment is only a part of the embodiment of the present invention, instead of all the embodiments.Based on this
Embodiment in invention, every other reality obtained by those of ordinary skill in the art without making creative efforts
Example is applied, shall fall within the protection scope of the present invention.
Fig. 1 show a kind of flow diagram of speech recognition exchange method of one embodiment of the invention offer.Such as Fig. 1
Shown, which includes the following steps:
Step 101: Emotion identification result is obtained according to user speech message.
Wherein, audio Emotion identification is included at least in Emotion identification result as a result, or including at least in Emotion identification result
Audio Emotion identification result and text Emotion identification result.
User speech message refers to input by user during interacting with user or getting with user friendship
Mutually it is intended to voice messaging relevant with demand.For example, felt concerned about in the customer service interaction scenarios of system in a call, user speech message
Concrete form can include the user speech message that user issues, and user at this time may be that client is also likely to be service
End;Again for example in intelligent robot interaction scenarios, user speech message just may include that user passes through the defeated of the intelligent robot
Enter the voice messaging etc. of module input, the present invention to the specific source of user speech message without limitation.
For example, since the audio data of the user speech message of different emotional states will include different audio frequency characteristics, this
When can according to the audio data of user speech message obtain audio Emotion identification as a result, and according to audio Emotion identification result it is true
Thread of pledging love recognition result.
It will be intended to believe with basic during subsequent according to Emotion identification result accessed by the user speech message
Breath is combined, and to speculate that the mood of user is intended to, or is directly provided and is had according to basic intent information and Emotion identification result
The interactive instruction of mood.
Step 102: intention analysis being carried out according to the content of text of user speech message, obtains corresponding basic intention letter
Breath.
It is intention that user speech message intuitively reflects that basic intent information is corresponding, but can not simultaneously reflect that user works as
True emotional demand under preceding state, therefore it is practical just to need to integrate determining user speech message institute in conjunction with Emotion identification result
The profound intention and emotional need being intended by.For example, the emotional state made up for lost time for one is anxious use
Family just starts to do the emotional state of stroke planning to be gentle user with one, when in the user speech message that the two is issued
When appearance is similarly inquiry Flight Information, it is all inquiry Flight Information, but the two that obtained basic intent information, which is also identical,
Required emotional need is obviously different.
It should be appreciated that basic intent information can carry out being intended to analyze according to the content of text of user speech message obtaining, it should
It is the intention that is reflected in semantic level of content of text of user speech message that basic intent information is corresponding, can't be had
Any emotion.
It in an embodiment of the present invention, can also root in order to further increase the accuracy of acquired basic intent information
According to current user speech message, and passing user speech message and/or subsequent user speech message is combined to be intended to
Analysis, obtains corresponding basic intent information.For example, may lack in the intention of present user speech message some keywords and
Slot position (slot), but these contents can be obtained by passing user speech message and/or subsequent user speech message.Example
Such as, the content of current user speech message is " having what specialty? " when, what subject (slot) therein was missing from, but pass through
In conjunction with passing user speech message " how is Changzhou weather? " i.e. extractable " Changzhou " is used as subject, finally obtains in this way
The basic intent information of present user speech message can be " what specialty Changzhou has? ".
Step 103: corresponding interactive instruction is determined according to Emotion identification result and basic intent information.
Corresponding relationship between Emotion identification result and basic intent information and interactive instruction can be by building with learning process
It is vertical.In an embodiment of the present invention, the content and form of interactive instruction can include one or more of emotion and mode: text is presented
Export emotion presentation mode, melody plays emotion presentation mode, mode is presented in speech emotional, mode and machinery is presented in Image emotional semantic
It acts emotion and mode is presented.It should be appreciated, however, that the specific emotion presentation mode of interactive instruction can also be according to the need of interaction scenarios
It asks and adjusts, the present invention is to the particular content and form of interactive instruction and without limitation.
In an embodiment of the present invention, it can be and corresponding feelings first determined according to Emotion identification result and basic intent information
Then thread intent information determines corresponding interactive instruction further according to mood intent information, or according to mood intent information and substantially
Intent information determines the corresponding interactive instruction.Mood intent information at this time can have specific content can also be only as reflecting
The mark for penetrating relationship exists.Corresponding relationship and mood intent information between mood intent information and interactive instruction and basic
Corresponding relationship between intent information and interactive instruction can also be pre-established by pre- learning process.
Specifically, mood intent information refers to the intent information with emotion, can be intended to substantially in reflection
The emotional need for reflecting user speech message simultaneously, between mood intent information and Emotion identification result and basic intent information
Corresponding relationship can be pre-established by pre- learning process.In an embodiment of the present invention, which may include and feelings
The corresponding affection need information of thread recognition result, or may include affection need information corresponding with Emotion identification result and mood
The incidence relation of recognition result and basic intent information.The incidence relation of Emotion identification result and basic intent information is to set in advance
It is fixed.For example, when the content of basic intent information is " reporting the loss credit card ", being determined when the content of Emotion identification result is " anxiety "
The content of mood intent information out just may include the incidence relation of Emotion identification result Yu basic intent information: " report the loss credit
Card, user is very anxious, and possible credit card is lost or is stolen ", while identified affection need information can be " comfort ".
The Emotion identification result affective state and the incidence relation of the basic intent information can be preset
(for example passing through rule settings or logic judgment);It is also possible to (such as trained end-to-end based on specific training pattern
Model can input the directly output emotion such as affective state and scene information and be intended to), this training pattern can be fixed depth
Degree network model (this is not the rule of a setting), can also be constantly updated by on-line study (for example utilize enhancing study
Model sets objective function and reward function in a model, and as human-computer interaction number increases, which can also be constantly updated
Develop).
It should be appreciated that being to need to show the feedback content to the mood intent information under application scenes
's.Such as under some customer service interaction scenarios, need to be presented the mood intent information analyzed according to the voice content of client
To contact staff, to play reminding effect, corresponding mood intent information must be just determined at this time, and will be intended to the mood
The feedback content of information shows.However under other application scenarios, need to directly give corresponding interactive instruction, and
It does not need to show the feedback content to the mood intent information, it at this time can also be according to Emotion identification result and basic intention letter
Breath directly determines corresponding interactive instruction, and does not have to generate mood intent information.
It in an embodiment of the present invention, can also be in order to further increase the accuracy of acquired mood intent information
According to the Emotion identification result and basic intent information of current user speech message, and combine passing user speech message and
/ or subsequent user speech message Emotion identification result and basic intent information, determine corresponding mood intent information.At this time
With regard to needing to record the Emotion identification result and basic intent information of current user speech message in real time, in order to according to other
User speech message when determining mood intent information as reference.
It in an embodiment of the present invention, can also in order to further increase the accuracy of acquired corresponding interactive instruction
With according to the mood intent information and basic intent information of current user speech message, and combine passing user speech message
And/or the mood intent information and basic intent information of subsequent user speech message, determine corresponding interactive instruction.At this time
Need to record the Emotion identification result and basic intent information of current user speech message in real time, in order to according to others
As reference when user speech message determines interactive instruction.
It can be seen that speech recognition exchange method provided in an embodiment of the present invention, in the basic intent information for understanding user
On the basis of, combine based on user speech message obtain Emotion identification as a result, simultaneously further speculate user mood be intended to,
Or the interactive instruction that band is in a bad mood directly is provided according to basic intent information and Emotion identification result, to solve the prior art
In intelligent interaction mode can not analyze the profound of user speech message and be intended to and emotional need and more people can not be provided
The problem of interactive experience of property.
In an embodiment of the present invention, when user speech message includes user speech message, Emotion identification result can root
It is determined according to according to audio Emotion identification result and text Emotion identification result are comprehensive.In particular, it is desirable to according to user speech
The audio data of message obtains audio Emotion identification as a result, and obtaining the knowledge of text mood according to the content of text of user speech message
Not as a result, then determining Emotion identification result according to audio Emotion identification result and text Emotion identification result are comprehensive.However
As previously mentioned, can also determine final Emotion identification according only to audio Emotion identification result as a result, the present invention does not limit this
It is fixed.
It should be appreciated that audio Emotion identification result and text Emotion identification result can characterize in several ways.At this
It invents in an embodiment, the mode of discrete mood classification can be used to characterize Emotion identification as a result, audio Emotion identification at this time
As a result one of multiple mood classification or a variety of can be respectively included with text Emotion identification result.For example, in customer service interaction field
Jing Zhong, multiple mood classification is just can include: satisfied classification, tranquil classification and irritated classification, to correspond to customer service interaction scenarios
The emotional state that middle user is likely to occur;Alternatively, multiple mood classification can include: satisfied classification, tranquil classification, irritated classification
And angry classification, to correspond to the emotional state that contact staff is likely to occur in customer service interaction scenarios.It should be appreciated, however, that these
The type and quantity of mood classification can adjust, the type sum number that the present invention classifies to mood according to actual application scenarios demand
Amount does not do considered critical equally.In a further embodiment, each mood classification may also include multiple emotional intensity ranks.Tool
For body, mood classification and emotional intensity rank may be considered two dimensional parameters, can be independently of one another (for example, every kind of feelings
Thread classification has corresponding N kind emotional intensity rank, for example, slightly, moderate and severe), can also have preset corresponding relationship
(such as the classification of " agitation " mood includes three kinds of emotional intensity ranks, slight, moderate and severe;And the classification of " satisfaction " mood is only wrapped
Include two kinds of emotional intensity ranks, moderate and severe).It can be seen that emotional intensity rank at this time can regard that mood is classified as
A property parameters, when by Emotion identification process determine a kind of mood classification when, also determined that the mood classification feelings
Thread intensity rank.
In an alternative embodiment of the invention, the mode of non-discrete dimension mood model can be used also to characterize Emotion identification
As a result.Audio Emotion identification result and text Emotion identification result can respectively correspond a coordinate in multidimensional emotional space at this time
Point, the emotional factor that the corresponding psychology of each dimension in multidimensional emotional space defines.For example, PAD can be used
(PleasureArousalDominanc) three dimensional mood model.The model thinks that mood has pleasure degree, activity and dominance
Three dimensions, every kind of mood can all be characterized by the corresponding emotional factor of these three dimensions institute.Wherein P represents pleasure
Degree indicates the positive negative characteristic of individual emotional state;A represents activity, indicates the nerve triumph activation level of individual;D represents excellent
Gesture degree indicates individual to scene and other people state of a control.
It should be appreciated that audio Emotion identification result and text Emotion identification result can also be used other characteristic manners and carry out table
Sign, the present invention is to specific characteristic manner and without limitation.
Fig. 2 show the process that Emotion identification result is determined in the speech recognition exchange method of one embodiment of the invention offer
Schematic diagram.User speech message in the embodiment includes at least user speech message, and Emotion identification result is needed according to audio
Emotion identification result and the comprehensive determination of text Emotion identification result, and audio Emotion identification result and text Emotion identification result point
Not Bao Kuo one of multiple moods classification or a variety of, the method for the determination Emotion identification result may include following steps at this time:
Step 201:, will if audio Emotion identification result and text Emotion identification result include identical mood classification
Identical mood classification is used as Emotion identification result.
For example, when audio Emotion identification result includes satisfied classification and tranquil classification, and text Emotion identification result is only
When including satisfied classification, then final Emotion identification result can be satisfied classification.
Step 202: if audio Emotion identification result and text Emotion identification result do not include identical mood classification,
Then by audio Emotion identification result and text Emotion identification result collectively as Emotion identification result.
For example, when audio Emotion identification result includes satisfied classification, and text Emotion identification result only includes calmness
When classification, then final Emotion identification result can be satisfied classification and tranquil classification.In an embodiment of the present invention, when final
Emotion identification result when including the classification of multiple moods, passing user speech message is just also combined during subsequent
And/or the Emotion identification result and basic intent information of subsequent user speech message, with the corresponding mood intent information of determination.
Although it should be appreciated that being defined in step 202 when audio Emotion identification result and text Emotion identification result do not have
When having including the classification of identical mood, by audio Emotion identification result and text Emotion identification result collectively as Emotion identification knot
Fruit, but in other embodiments of the invention, can also take more conservative interactive strategy, for example, directly generate error information or
Emotion identification result etc. is not exported, in order to avoid causing to mislead to interactive process, the present invention is to audio Emotion identification result and text feelings
Processing mode when thread recognition result does not include the classification of identical mood does not do considered critical.
Fig. 3 show the process that Emotion identification result is determined in the speech recognition exchange method of one embodiment of the invention offer
Schematic diagram.User speech message in the embodiment also includes at least user speech message, and Emotion identification result is also required to basis
Audio Emotion identification result and the comprehensive determination of text Emotion identification result, and audio Emotion identification result and text Emotion identification knot
Fruit respectively includes one of multiple mood classification or a variety of, and the method for the determination Emotion identification result may include following steps:
Step 301: calculating in audio Emotion identification result in the confidence level and text Emotion identification result of mood classification
The confidence level of mood classification.
Statistically, confidence level is also referred to as reliability, confidence level or confidence coefficient.Since sample has randomness,
When being made an estimate using sampling to population parameter, the conclusion obtained is always uncertain.Therefore, it can be used in mathematical statistics
Interval estimation method estimate that probability of the error between estimated value and population parameter within the scope of certain allow has
Much, this corresponding probability is referred to as confidence level.For example, it is assumed that a change of preset mood classification and characterization mood classification
It measures related, that is, different values can be mapped to according to the classification of the size mood of the variate-value.It is tied when voice mood to be obtained identifies
When the confidence level of fruit, first passes through multiple audio Emotion identification/text Emotion identification process and obtains multiple measured values of the variable,
Then using the mean value of multiple measured value as an estimated value.The estimated value and the variable are estimated by interval estimation method again
True value between error range probability in a certain range, this probability value is bigger, and this estimated value of explanation is more accurate, i.e.,
The confidence level of current mood classification is higher.It should be appreciated that the variable of above-mentioned characterization mood classification can be according to the tool of Emotion identification
Body algorithm and determine, which is not limited by the present invention.
Step 302: judging the highest mood classification of confidence level and text Emotion identification result in audio Emotion identification result
Whether the middle highest mood classification of confidence level is identical.If it is judged that be it is yes, then follow the steps 303, it is no to then follow the steps
304。
Step 303: will be in the highest mood classification of confidence level in audio Emotion identification result or text Emotion identification result
The highest mood classification of confidence level is used as Emotion identification result.
Illustrate that the highest mood classification of confidence level is phase in audio Emotion identification result and text Emotion identification result at this time
With, therefore can be by the identical highest mood classification of confidence level directly as final Emotion identification result.For example, working as sound
Frequency Emotion identification result include satisfied classification (confidence level a1) and it is tranquil classify (confidence level a2), and text Emotion identification
When as a result only including satisfied classification (confidence level b1), and when a1 > a2, then by satisfied classification as final Emotion identification
As a result.
Step 304: the confidence level and text mood of the highest mood classification of confidence level in comparing audio Emotion identification result
The confidence level of the highest mood classification of confidence level in recognition result.
In an embodiment of the present invention, it is contemplated that in actual application scenarios, according to the specific algorithm of Emotion identification with
And the limitation of the type and content of user speech message, in selectable audio Emotion identification result and text Emotion identification result
One exports as the Emotion identification result mainly considered, and another Emotion identification result considered as auxiliary is exported,
Then confidence level and emotional intensity rank etc. is recycled to determine final Emotion identification result because usually comprehensive.It should be appreciated that choosing
Which of audio Emotion identification result and text Emotion identification result are selected as the Emotion identification result output mainly considered
It can (such as sample rate be lower or noise depending on actual scene, such as when the audio quality of user speech message is not high
It is larger, it is difficult to extract audio frequency characteristics), but when can satisfy text conversion condition, so that it may select text Emotion identification result
As the Emotion identification result output mainly considered, and using audio Emotion identification result as the Emotion identification result of auxiliary consideration
Output.However the present invention is to selecting which of audio Emotion identification result and text Emotion identification result to consider as main
Emotion identification result output and without limitation.
In an embodiment of the present invention, it is exported audio Emotion identification result as the Emotion identification result mainly considered,
The Emotion identification result output that text Emotion identification result is considered as auxiliary.At this point, if in audio Emotion identification result
The confidence level of the highest mood classification of confidence level is greater than the confidence of the highest mood classification of confidence level in text Emotion identification result
Degree executes step 305;If the confidence level of the highest mood classification of confidence level is less than text mood in audio Emotion identification result
The confidence level of the highest mood classification of confidence level, executes step 306 in recognition result;If confidence in audio Emotion identification result
The confidence level for spending highest mood classification is equal to the confidence level of the highest mood classification of confidence level in text Emotion identification result, holds
Row step 309.
Step 305: regarding the highest mood classification of confidence level in audio Emotion identification result as Emotion identification result.
Due to having selected audio Emotion identification result as the Emotion identification result output mainly considered, natively answer
Pay the utmost attention to the mood classification in audio Emotion identification result;Along with the highest mood of confidence level in audio Emotion identification result
The confidence level of classification is greater than the confidence level of the highest mood classification of confidence level in text Emotion identification result, therefore master just may be selected
The highest mood classification of confidence level is used as Emotion identification result in the audio Emotion identification result to be considered.For example, working as audio feelings
Thread recognition result include satisfied classification (confidence level a1) and it is tranquil classify (confidence level a2), and text Emotion identification result
When only including tranquil classification (confidence level b1), a1 > a2 and when a1 > b1, then by satisfied classification as final mood knowledge
Other result.
Step 306: judging in audio Emotion identification result whether to include confidence level highest in text Emotion identification result
Mood classification.If it is judged that be it is yes, then follow the steps 307;If it is judged that be it is no, then follow the steps 309.
When the confidence level of the highest mood classification of confidence level in audio Emotion identification result is less than text Emotion identification result
When the confidence level of the middle highest mood classification of confidence level, illustrate that the highest mood classification of confidence level can in text Emotion identification result
Can be more credible, but due to having selected audio Emotion identification result as the Emotion identification result output mainly considered, it needs
Judge in a subaudio frequency Emotion identification result whether to include the highest mood of the confidence level in text Emotion identification result
Classification.If in audio Emotion identification result really including the highest mood of the confidence level point in text Emotion identification result
Class can then measure the highest feelings of the confidence level whether needed in text Emotion identification result by emotional intensity rank
Thread classification considers as auxiliary.For example, when audio Emotion identification result includes satisfied classification (confidence level a1) and tranquil point
Class (confidence level a2), and text Emotion identification result only includes tranquil classification (confidence level b1), a1 > a2 and a1 < b1
When, then need to judge in a subaudio frequency Emotion identification result whether to include that confidence level in text Emotion identification result is highest
Calmness classification.
Step 307: further judging that confidence level is highest in the text Emotion identification result in audio Emotion identification result
Whether the emotional intensity rank of mood classification is greater than the first intensity threshold.If the result further judged be it is yes, execute step
Rapid 308;It is no to then follow the steps 309.
Step 308: regarding the highest mood classification of confidence level in text Emotion identification result as Emotion identification result.
Going to step 308 and illustrating includes that this in text Emotion identification result is set in audio Emotion identification result really
The highest mood classification of reliability, and the emotional intensity rank of the highest mood classification of the confidence level in text Emotion identification result
It is sufficiently high.Mean that the highest mood classification of the confidence level in text Emotion identification result is not only with a high credibility at this time, and feelings
The tendency of thread is fairly obvious, therefore can regard the highest mood classification of confidence level in text Emotion identification result as Emotion identification knot
Fruit.For example, when audio Emotion identification result include satisfied classification (confidence level a1) and it is tranquil classify (confidence level a2), and
Text Emotion identification result only includes tranquil classification (confidence level b1), a1 > a2, a1 < b1 and text Emotion identification result
In the emotional intensity rank of tranquil classification be greater than the first intensity threshold, then by calmness classification as final Emotion identification knot
Fruit.
Step 309: by the highest mood classification of confidence level in audio Emotion identification result as Emotion identification as a result, or will
The highest mood classification of confidence level and the highest mood of confidence level point in text Emotion identification result in audio Emotion identification result
Class is collectively as Emotion identification result.
When the confidence level of the highest mood classification of confidence level in audio Emotion identification result is equal to text Emotion identification result
It does not include in text Emotion identification result in the confidence level or audio Emotion identification result of the middle highest mood classification of confidence level
Confidence level highest mood classification, or even if include that confidence level is most in text Emotion identification result in audio Emotion identification result
When the emotional intensity rank of high mood classification but mood classification is not high enough, illustrate to be not possible at this time according to audio Emotion identification
As a result a unified mood is exported with text Emotion identification result to classify as final Emotion identification result.At this point, at this
It invents in an embodiment, it is contemplated that select audio Emotion identification result to export as the Emotion identification result mainly considered, because
This directly regard the highest mood classification of confidence level in audio Emotion identification result as Emotion identification result.Of the invention another
It, can also be by audio Emotion identification result and text Emotion identification result collectively as Emotion identification result in one embodiment.And
The Emotion identification result of passing user speech message and/or subsequent user speech message and basic is combined during subsequent
Intent information determines corresponding mood intent information.
Fig. 4 show another embodiment of the present invention provides speech recognition exchange method in determine Emotion identification result stream
Journey schematic diagram.Different from embodiment shown in Fig. 3, in embodiment shown in Fig. 4 in selected text Emotion identification result as
The Emotion identification result output mainly considered, and audio Emotion identification result is defeated as the Emotion identification result of auxiliary consideration
Out.It should be appreciated that the process of the determination Emotion identification result can be similar to flow logic shown in Fig. 3 at this time, it is only by conduct
The Emotion identification result output mainly considered is changed for text Emotion identification as a result, specifically may include following steps, but repeat
Logical description repeat no more:
Step 401: calculating in audio Emotion identification result in the confidence level and text Emotion identification result of mood classification
The confidence level of mood classification.
Step 402: judging the highest mood classification of confidence level and text Emotion identification result in audio Emotion identification result
Whether the middle highest mood classification of confidence level is identical.If it is judged that be it is yes, then follow the steps 403, it is no to then follow the steps
404。
Step 403: will be in the highest mood classification of confidence level in audio Emotion identification result or text Emotion identification result
The highest mood classification of confidence level is used as Emotion identification result.
Step 404: comparing the confidence level and audio mood of the highest mood classification of confidence level in text Emotion identification result
The confidence level of the highest mood classification of confidence level in recognition result.
If the confidence level of the highest mood classification of confidence level is greater than audio Emotion identification knot in text Emotion identification result
The confidence level of the highest mood classification of confidence level, executes step 405 in fruit;If confidence level highest in text Emotion identification result
The confidence level of mood classification be less than the confidence level of confidence level highest mood classification in audio Emotion identification result, execute step
406;If the confidence level of the highest mood classification of confidence level is equal in audio Emotion identification result in text Emotion identification result
The confidence level of the highest mood classification of confidence level, executes step 409.
Step 405: regarding the highest mood classification of confidence level in text Emotion identification result as Emotion identification result.
Step 406: judging in text Emotion identification result whether to include confidence level highest in audio Emotion identification result
Mood classification.If it is judged that be it is yes, then follow the steps 407;If it is judged that be it is no, then follow the steps 409.
Step 407: further judging that confidence level is highest in the audio Emotion identification result in text Emotion identification result
Whether the emotional intensity rank of mood classification is greater than the first intensity threshold.If the result further judged be it is yes, execute step
Rapid 408;It is no to then follow the steps 409.
Step 408: regarding the highest mood classification of confidence level in audio Emotion identification result as Emotion identification result.
Step 409: by the highest mood classification of confidence level in text Emotion identification result as Emotion identification as a result, or will
The highest mood classification of confidence level and the highest mood of confidence level point in audio Emotion identification result in text Emotion identification result
Class is collectively as Emotion identification result.
It should be appreciated that although the embodiment of Fig. 3 and Fig. 4 gives the example of determining Emotion identification result, according to audio
Emotion identification result is different with the concrete form of text Emotion identification result, this is according to audio Emotion identification result and text mood
The comprehensive process for determining Emotion identification result of recognition result can also take other modes to realize, and be not limited to Fig. 3 and Fig. 4 institute
The embodiment shown, the present invention is to this and without limitation.
In an embodiment of the present invention, audio Emotion identification result and text Emotion identification result respectively correspond multidimensional emotion
A coordinate points in space, at this time can be by audio Emotion identification result and text Emotion identification result in multidimensional emotional space
In the coordinate values of coordinate points be weighted and averaged processing, the coordinate points obtained after weighted average is handled are as Emotion identification knot
Fruit.For example, audio Emotion identification result is characterized as (p1, a1, d1), text Emotion identification when using PAD three dimensional mood model
As a result be characterized as (p2, a2, d2), then final Emotion identification result just may be characterized as ((p1+p2)/2, (a1+1.3*a2)/
2, (d1+0.8*d2)/2), therein 1.3 and 0.8 is weight coefficient.Non-discrete dimension mood model is used to be more convenient for measure
The mode of change calculates final Emotion identification result.It should be appreciated, however, that combination mode is not limited to above-mentioned add
Weight average processing, the present invention respectively correspond in multidimensional emotional space to when audio Emotion identification result and text Emotion identification result
A coordinate points when determine the concrete mode of Emotion identification result without limitation.
In one embodiment of the invention, which obtains the stream of audio Emotion identification result
Journey includes:
Step 501: extracting the audio feature vector of the user speech message in audio stream to be identified, wherein user speech disappears
One section of word in the corresponding audio stream to be identified of breath.
Audio feature vector includes value of at least one audio frequency characteristics at least one vector direction.It is in fact in this way
All audio frequency characteristics are characterized using the vector space of a multidimensional, in the vector space, the direction of audio feature vector
Can regard that the value in the vector direction different by many each leisures of audio frequency characteristics is summed in vector space as with value and
At wherein value of each audio frequency characteristics in a vector direction can regard the one-component of audio feature vector as.Include
The user speech message of different moods necessarily has different audio frequency characteristics, and the present invention exactly utilizes different moods and different audios
Corresponding relationship between feature identifies the mood of user speech message.Specifically, audio frequency characteristics may include following several
One of or it is a variety of: energy feature, pronunciation frame number feature, fundamental frequency feature, formant feature, harmonic to noise ratio feature with
And mel cepstrum coefficients feature.In an embodiment of the present invention, following vector direction: ratio can be set in the vector space
Value, mean value, maximum value, intermediate value and standard deviation.
Energy feature refers to the power spectrum characteristic of user speech message, can sum to obtain by power spectrum.Calculation formula
It can are as follows:Wherein E indicates the value of energy feature, and k represents the number of frame, and j represents the number of Frequency point, N
For frame length, P indicates the value of power spectrum.In an embodiment of the present invention, energy feature may include short-time energy first-order difference,
And/or predeterminated frequency energy size below.The calculation formula of short-time energy first-order difference can are as follows:
VE (k)=(- 2*E (k-2)-E (k-1)+E (k+1)+2*E (k+2))/3;
Predeterminated frequency energy size below can be measured by ratio value, such as 500Hz or less band energy accounts for total energy
The calculation formula of the ratio value of amount can are as follows:
Wherein j500For the corresponding frequency point number of 500Hz, k1 is the volume of the voice start frame of user speech message to be identified
Number, k2 is the number of the voice end frame of user speech message to be identified.
Pronunciation frame number feature refers to the population size of pronunciation frame in user speech message, the population size of the pronunciation frame
It can be measured by ratio value.Such as remember in the user speech message that the quantity of pronunciation frame and mute frame is respectively n1 and n2,
The ratio of frame number and mute frame number of then pronouncing is p2=n1/n2, the ratio of pronounce frame number and totalframes are as follows: p3=n1/ (n1+
n2)。
Fundamental frequency feature can be used based on the algorithm of the auto-correlation function of linear prediction (LPC) error signal and extract.
Fundamental frequency feature may include fundamental frequency and/or fundamental frequency first-order difference.The algorithm flow of fundamental frequency can be as follows: first
First, it calculates the linear predictor coefficient of pronunciation frame x (k) and calculates linear prediction estimation signalSecondly, error signal
Auto-correlation function c1:Then, in the offset ranges that corresponding fundamental frequency is 80-500Hz,
The maximum value for finding auto-correlation function, records its corresponding offset Δ h.The calculation formula of fundamental frequency F0 are as follows: F0=Fs/ Δ
H, wherein Fs is sample frequency.
Formant feature can be used based on the algorithm of the polynomial rooting of linear prediction and extract, it may include the first resonance
The first-order difference at peak, the second formant and third formant and three formants.Harmonic to noise ratio (HNR) feature can adopt
It is extracted with based on the algorithm of independent component analysis (ICA).Mel cepstrum (MFCC) coefficient characteristics may include that 1-12 rank Meier is fallen
Spectral coefficient can be used general mel cepstrum coefficients calculation process and obtain, and details are not described herein.
Can be depending on the demand of actual scene it should be appreciated which audio feature vector specifically extracted, the present invention is to institute
Extract type, quantity and the vector direction of audio frequency characteristics corresponding to audio feature vector without limitation.However in the present invention
In one embodiment, in order to obtain optimal Emotion identification effect, six above-mentioned audio frequency characteristics can be extracted simultaneously: energy feature,
Pronunciation frame number feature, fundamental frequency feature, formant feature, harmonic to noise ratio feature and mel cepstrum coefficients feature.For example,
When extracting six above-mentioned audio frequency characteristics simultaneously, extracted audio feature vector just may include as shown in table 1 below 173
A component, using the audio feature vector and Gauss model (GMM) of the following table 1 as emotional characteristics model come to casia Chinese
The accuracy that mood corpus carries out voice mood identification can achieve 74% to 80%.
Table 1
In an embodiment of the present invention, audio stream to be identified can be customer service interactive audio stream, user speech message it is corresponding to
Identify that a user in audio stream inputs voice segments or a customer service inputs voice segments.Since customer interaction process is often one
Ask a form answered, therefore a user inputs voice segments and can correspond to the primary enquirement of user in an interactive process or return
It answers, and customer service input voice segments can correspond to the primary enquirement or answer of contact staff in an interactive process.Due to one
As think user or customer service it is primary put question to or answer in can completely expression mood, therefore by the way that a user is inputted voice
The unit of section or customer service input voice segments as Emotion identification, not only can guarantee the integrality of Emotion identification, but also can guarantee visitor
Take the real-time of Emotion identification in interactive process.
Step 502: the audio feature vector of user speech message being matched with multiple emotional characteristics models, wherein more
A emotional characteristics model respectively corresponds one of multiple mood classification.
These emotional characteristics models can be by including that multiple moods are classified the multiple default of corresponding mood tag along sort
The respective audio feature vector of user speech message is learnt in advance and is established, and is equivalent to establish emotional characteristics mould in this way
Corresponding relationship between type and mood classification, each emotional characteristics model can correspond to a mood classification.This establishes emotional characteristics
The pre- learning process of model can include: first will include that multiple moods are classified multiple pre-set users of corresponding mood tag along sort
The respective audio feature vector of speech message carries out clustering processing, obtains the cluster result (S61) of default mood classification;Then,
According to cluster result, the audio feature vector of the pre-set user speech message in each cluster is trained for an emotional characteristics mould
Type (S62).Based on these emotional characteristics models, can be obtained by the matching process based on audio feature vector and active user
The corresponding emotional characteristics model of speech message, and corresponding mood classification is obtained in turn.
In an embodiment of the present invention, these emotional characteristics models can be that (degree of mixing can be mixed Gauss model (GMM)
5).It can first be clustered in this way using emotional characteristics vector of the K-means algorithm to the speech samples that same mood is classified, according to
Cluster result calculates the initial value of the parameter of mixed Gauss model (the number of iterations can be 50).Then it is instructed again using E-M algorithm
Practise the corresponding mixed Gauss model (the number of iterations 200) of all kinds of moods classification.When to utilize these mixed Gauss models into
Market thread classification matching process when, can by calculate present user speech message audio feature vector respectively with multiple moods
Then likelihood probability between characteristic model determines matched emotional characteristics model by measuring the likelihood probability, such as will
Likelihood probability is greater than preset threshold and maximum emotional characteristics model as matched emotional characteristics model.
Although it should be appreciated that elaborating that emotional characteristics model can be mixed Gauss model in the above description, in fact
The emotional characteristics model can also be realized by other forms, such as support vector machines (SVM) model, K arest neighbors sorting algorithm
(KNN) model, Markov model (HMM) and neural network (ANN) model etc..Tool of the present invention to the emotional characteristics model
Body way of realization does not do considered critical.Simultaneously it should be appreciated that according to the variation of emotional characteristics model realization mode, matching process
Way of realization can also be adjusted, the present invention to the specific implementation form of the matching process equally without limitation.
In an embodiment of the present invention, multiple mood classification can include: satisfied classification, tranquil classification and irritated point
Class, to correspond to the emotional state that user is likely to occur in customer service interaction scenarios.In another embodiment, multiple mood classification can
It include: that satisfaction is classified, tranquil classification, agitation is classified and anger classification, it may to correspond to contact staff in customer service interaction scenarios
The emotional state of appearance.That is, when audio stream to be identified is user's customer service interactive audio stream in customer service interaction scenarios, if current use
When the corresponding customer service input voice segments of family speech message, multiple mood classification can include: satisfied classification, tranquil classification and
Agitation classification;If the corresponding user of present user speech message inputs voice segments, multiple mood classification can include: satisfied
Classification, tranquil classification, irritated classification and angry classification.Classified by the above-mentioned mood to user and customer service, Ke Yigeng
Succinct is suitable for call center system, reduces calculation amount and meets the Emotion identification demand of call center system.However it should
Understand, the type and quantity of these moods classification can be adjusted according to actual application scenarios demand, and the present invention classifies to mood
Type and quantity do not do considered critical equally.
Step 503: being that the corresponding mood classification of emotional characteristics model to match is used as user speech by matching result
The mood of message is classified.
As previously described, because between emotional characteristics model and mood classification, there are corresponding relationships, therefore when according to step 502
Matching process the emotional characteristics model to match has been determined after, the corresponding mood classification of the matched emotional characteristics model is just
For the mood classification identified.For example, the matching process can lead to when these emotional characteristics models are mixed Gauss model
Cross the side for measuring the audio feature vector likelihood probability between multiple emotional characteristics models respectively of present user speech message
Formula is realized, likelihood probability is then greater than preset threshold and the corresponding mood classification of maximum emotional characteristics model is used as user
The mood of speech message is classified.
It can be seen that a kind of voice mood recognition methods provided in an embodiment of the present invention, by extracting audio stream to be identified
In user speech message audio feature vector, and using the emotional characteristics model that pre-establishes to extracted audio frequency characteristics
Vector is matched, to realize the real-time emotion identification to user speech message.In this way in such as call center system
Under application scenarios, the emotional state of real-time monitoring customer service and client in customer service interaction call may be implemented, be remarkably improved and adopt
With the service quality of the enterprise of the call center system and the customer service experience of client.
It is also understood that the mood classification that voice mood recognition methods based on the embodiment of the present invention is identified,
Specific scene demand can be also further cooperated to realize more flexible secondary applications.It in an embodiment of the present invention, can be real-time
Show the mood classification of the user speech message currently identified, specific real-time display mode can be according to actual scene demand
And it adjusts.For example, can be classified with the different colours of signal lamp to characterize different moods, blue lamp represents " satisfaction ", and green light represents
" calmness ", amber light represent " agitation ", and red light represents " anger ".In this way according to the variation of signal lamp color, can remind in real time
Contact staff and quality inspection personnel are conversed locating emotional state at present.In another embodiment, in also statistics available preset time period
The user speech message identified mood classification, such as the audio of calling record number, user speech message are opened
The timestamp and Emotion identification result of initial point and end point are recorded, and ultimately form an Emotion identification data bank, and unite
The number and probability that various moods occur in a period of time are counted out, curve graph or table are made, judges a period of time for enterprise
The reference frame of interior contact staff's service quality.In another embodiment, the user speech that can also send and be identified in real time
The corresponding mood response message of mood classification of message, this is applicable to prosthetic machine customer service scene on duty.For example, when real
When identify that user then automatically replies user's peace corresponding with " anger " state when being in " anger " state in call at present
Language is comforted, to calm down user mood, achievees the purpose that continue to link up.As for mood classification with it is corresponding between mood response message
Relationship can be pre-established by pre- learning process.
In an embodiment of the present invention, the audio feature vector for extracting the user speech message in audio stream to be identified it
Before, need first to extract user speech message from audio stream to be identified, in order to it is subsequent with user speech message be single
Position carries out Emotion identification, which can be real-time perfoming.
Fig. 5 show the text in the speech recognition exchange method of one embodiment of the invention offer according to user speech message
The flow diagram of content acquisition text Emotion identification result.As shown in figure 5, the content of text according to user speech message obtains
The process of text Emotion identification result may include following steps:
Step 1001: the mood vocabulary in the content of text of identification user speech message, according to the mood word identified
It converges and determines the first text Emotion identification result.
Corresponding relationship between mood vocabulary and the first text Emotion identification result can be established by pre- learning process, each
Mood vocabulary has corresponding mood classification and emotional intensity rank, will obtain according to preset statistic algorithm and corresponding relationship
The mood of the entire content of text of user speech message is classified and the emotional intensity rank of mood classification.For example, user's language
Include following mood vocabulary in the content of text of sound message: " thanks " (it is corresponding to be satisfied with mood classification, during emotional intensity rank is
Degree), " you are excellent " (corresponding to be satisfied with mood classification, emotional intensity rank is height), " excellent " (it is corresponding to be satisfied with mood classification,
Emotional intensity rank be height) etc. moods vocabulary when, the first corresponding text Emotion identification result may be satisfied with mood
Classification, and this is satisfied with the emotional intensity rank of mood classification as height.
Step 1002: the content of text of user speech message is inputted into text Emotion identification deep learning model, text feelings
Thread identifies deep learning model based on being trained to the content of text for including mood tag along sort and emotional intensity grade distinguishing label
And establish, using the output result of text Emotion identification deep learning model as the second text Emotion identification result.
Step 1003: determining that text mood is known according to the first text Emotion identification result and the second text Emotion identification result
Other result.
It should be appreciated that the first text Emotion identification result and the second text Emotion identification result can carry out table in several ways
Sign.In an embodiment of the present invention, the mode of discrete mood classification can be used to characterize Emotion identification as a result, the first text at this time
This Emotion identification result and the second text Emotion identification result can respectively include one of multiple mood classification or a variety of, wherein
Each mood classification may include multiple emotional intensity ranks.In an alternative embodiment of the invention, non-discrete dimension can also be used
The mode of mood model is spent to characterize Emotion identification as a result, the first text Emotion identification result and the second text Emotion identification knot
Fruit respectively corresponds a coordinate points in multidimensional emotional space, and the corresponding psychology of each dimension in multidimensional emotional space is fixed
The emotional factor of justice.The characteristic manner of the characteristic manner and non-discrete dimension mood model classified about discrete mood has existed
Front is described, and details are not described herein.It should be appreciated, however, that the first text Emotion identification result and the second text Emotion identification
As a result other characteristic manners can also be used to characterize, the present invention is to specific characteristic manner and without limitation.It should manage simultaneously
Solution, in an embodiment of the present invention, can also be according only to the first text Emotion identification result and the second text Emotion identification knot
One of fruit determines the final text Emotion identification as a result, this is not limited by the present invention.
Fig. 6 show the text in the speech recognition exchange method of one embodiment of the invention offer according to user speech message
The flow diagram of content acquisition text Emotion identification result.User speech message in the embodiment includes that user speech disappears
Breath, text Emotion identification result need comprehensive true according to the first text Emotion identification result and the second text Emotion identification result
It is fixed, and the first text Emotion identification result and the second text Emotion identification result respectively include one of multiple moods classification or
A variety of, the method for the determination text Emotion identification result may include following steps at this time:
Step 1101: if the first text Emotion identification result and the second text Emotion identification result include identical mood
Classification then regard the classification of identical mood as text Emotion identification result.
For example, when the first text Emotion identification result includes satisfied classification and tranquil classification, and the second text mood is known
When other result only includes satisfied classification, then final text Emotion identification result can be satisfied classification.
Step 1102: if the first text Emotion identification result and the second text Emotion identification result do not include identical
Mood classification, then by the first text Emotion identification result and the second text Emotion identification result collectively as text Emotion identification knot
Fruit.
For example, when the first text Emotion identification result includes satisfied classification, and the second text Emotion identification result is only wrapped
When having included tranquil classification, then final text Emotion identification result can be satisfied classification and tranquil classification.It is real in the present invention one
It applies in example, when final text Emotion identification result includes the classification of multiple moods, just also to be combined during subsequent
The text Emotion identification result and basic intent information of passing user speech message and/or subsequent user speech message, with
Determine corresponding mood intent information.
Although it should be appreciated that being defined in step 1102 when the first text Emotion identification result and the second text mood are known
When other result does not include the classification of identical mood, the first text Emotion identification result and the second text Emotion identification result are total to
Be used as text Emotion identification as a result, but in other embodiments of the invention, more conservative interactive strategy, example can also be taken
It such as directly generates error information or does not export text Emotion identification result, in order to avoid interactive process is caused to mislead, the present invention couple
Processing mode when first text Emotion identification result and the second text Emotion identification result do not include the classification of identical mood
Do not do considered critical.
Fig. 7, which is shown in the speech recognition exchange method of one embodiment of the invention offer, determines text Emotion identification result
Flow diagram.User speech message in the embodiment also includes user speech message, and text Emotion identification result is also required to
It is determined according to the first text Emotion identification result and the second text Emotion identification result are comprehensive, and the first text Emotion identification result
One of multiple mood classification or a variety of, the determination text Emotion identification knot are respectively included with the second text Emotion identification result
The method of fruit may include following steps:
Step 1201: the confidence level and the second text mood for calculating mood classification in the first text Emotion identification result are known
The confidence level that mood is classified in other result.
Statistically, confidence level is also referred to as reliability, confidence level or confidence coefficient.Since sample has randomness,
When being made an estimate using sampling to population parameter, the conclusion obtained is always uncertain.Therefore, it can be used in mathematical statistics
Interval estimation method estimate that probability of the error between estimated value and population parameter within the scope of certain allow has
Much, this corresponding probability is referred to as confidence level.For example, it is assumed that a change of preset mood classification and characterization mood classification
It measures related, that is, different values can be mapped to according to the classification of the size mood of the variate-value.When speech text mood to be obtained is known
When the confidence level of other result, first passes through the first multiple text Emotion identification/second text Emotion identification process and obtain the variable
Multiple measured values, then using the mean value of multiple measured value as an estimated value.This is estimated by interval estimation method again
The probability of error range in a certain range between estimated value and the true value of the variable, this probability value is bigger to illustrate that this is estimated
Evaluation is more accurate, i.e., the confidence level of current mood classification is higher.It should be appreciated that the variable of above-mentioned characterization mood classification can basis
The specific algorithm of Emotion identification and determine, which is not limited by the present invention.
Step 1202: judging the highest mood classification of confidence level and the second text mood in the first text Emotion identification result
Whether the highest mood classification of confidence level is identical in recognition result.If it is judged that be it is yes, then follow the steps 1203, otherwise hold
Row step 1204.
Step 1203: the highest mood classification of confidence level in the first text Emotion identification result or the second text mood are known
The highest mood classification of confidence level is used as text Emotion identification result in other result.
Illustrate the highest mood of confidence level in the first text Emotion identification result and the second text Emotion identification result at this time
Classification is identical, therefore can be by the identical highest mood classification of confidence level directly as final text Emotion identification knot
Fruit.For example, when the first text Emotion identification result includes satisfied classification (confidence level a1) and tranquil classifies that (confidence level is
A2), and the second text Emotion identification result will be then satisfied with when only including satisfied classification (confidence level b1), and when a1 > a2
Classification is as final text Emotion identification result.
Step 1204: comparing the confidence level and second of the highest mood classification of confidence level in the first text Emotion identification result
The confidence level of the highest mood classification of confidence level in text Emotion identification result.
In an embodiment of the present invention, it is contemplated that in actual application scenarios, according to the specific algorithm of Emotion identification with
And the limitation of the type and content of user speech message, the first text Emotion identification result and the second text Emotion identification may be selected
As a result the text feelings for being exported as the text Emotion identification result mainly considered, and another being considered as auxiliary in
Then the output of thread recognition result recycles confidence level and emotional intensity rank etc. to determine that final text mood is known because usually comprehensive
Other result.It should be appreciated that select which of the first text Emotion identification result and the second text Emotion identification result as
The text Emotion identification result output mainly considered can be depending on actual scene, and the present invention knows the first text mood of selection
Which of other result and the second text Emotion identification result export not as the text Emotion identification result mainly considered
It limits.
In an embodiment of the present invention, using the first text Emotion identification result as the text Emotion identification knot mainly considered
Fruit output exports the text Emotion identification result that the second text Emotion identification result considers as auxiliary.At this point, if first
The confidence level of the highest mood classification of confidence level is greater than confidence in the second text Emotion identification result in text Emotion identification result
The confidence level for spending highest mood classification, executes step 1205;If confidence level is highest in the first text Emotion identification result
The confidence level of mood classification executes step less than the confidence level of the highest mood classification of confidence level in the second text Emotion identification result
Rapid 1206;If the confidence level of the highest mood classification of confidence level is equal to the second text mood in the first text Emotion identification result
The confidence level of the highest mood classification of confidence level, executes step 1209 in recognition result.
Step 1205: regarding the highest mood classification of confidence level in the first text Emotion identification result as text Emotion identification
As a result.
Due to having selected the first text Emotion identification result to export as the text Emotion identification result mainly considered,
The mood classification in the first text Emotion identification result should natively be paid the utmost attention to;Along in the first text Emotion identification result
The confidence level of the highest mood classification of confidence level is greater than the highest mood classification of confidence level in the second text Emotion identification result
Confidence level, therefore the highest mood classification of confidence level in the first text Emotion identification result mainly considered just may be selected as text
This Emotion identification result.For example, when the first text Emotion identification result includes satisfied classification (confidence level a1) and tranquil point
Class (confidence level a2), and when the second text Emotion identification result only includes tranquil classification (confidence level b1), a1 > a2 and
When a1 > b1, then by satisfied classification as final text Emotion identification result.
Step 1206: judging in the first text Emotion identification result whether to include setting in the second text Emotion identification result
The highest mood classification of reliability.If it is judged that be it is yes, then follow the steps 1207;If it is judged that be it is no, then execute step
Rapid 1209.
When the confidence level of the highest mood classification of confidence level in the first text Emotion identification result is less than the second text mood
In recognition result when the confidence level of the highest mood classification of confidence level, illustrate confidence level highest in the second text Emotion identification result
Mood classification be likely more it is credible, but due to having selected the first text Emotion identification result as the text mood mainly considered
Recognition result output, it is therefore desirable to judge in the first text Emotion identification result whether to include the second text Emotion identification
As a result the highest mood classification of the confidence level in.If really including the second text feelings in the first text Emotion identification result
Whether the highest mood classification of the confidence level in thread recognition result can then be measured by emotional intensity rank and need the
The highest mood classification of the confidence level in two text Emotion identification results considers as auxiliary.For example, when the first text mood
Recognition result include satisfied classification (confidence level a1) and it is tranquil classify (confidence level a2), and the second text Emotion identification knot
When fruit only includes tranquil classification (confidence level b1), a1 > a2 and a1 < b1, then need to judge the first text mood knowledge
It whether include the highest tranquil classification of confidence level in the second text Emotion identification result in other result.
Step 1207: further judging confidence in the second text Emotion identification result in the first text Emotion identification result
Whether the emotional intensity rank for spending highest mood classification is greater than the first intensity threshold.If the result further judged be it is yes,
Then follow the steps 1208;It is no to then follow the steps 1209.
Step 1208: regarding the highest mood classification of confidence level in the second text Emotion identification result as text Emotion identification
As a result.
It goes to step 1208 and illustrates that in the first text Emotion identification result include the second text Emotion identification knot really
The highest mood classification of the confidence level in fruit, and the highest mood classification of the confidence level in the second text Emotion identification result
Emotional intensity rank it is sufficiently high.Mean the highest mood classification of the confidence level in the second text Emotion identification result at this time
It is not only with a high credibility, and the tendency of mood is fairly obvious, therefore can be highest by confidence level in the second text Emotion identification result
Mood classification is used as text Emotion identification result.For example, when the first text Emotion identification result includes satisfied classification (confidence
Degree is a1) and tranquil classification (confidence level a2), and the second text Emotion identification result only includes tranquil classifying that (confidence level is
B1), the emotional intensity rank of the tranquil classification in a1 > a2, a1 < b1 and the second text Emotion identification result has been greater than the last the first
Threshold value is spent, then by calmness classification as final text Emotion identification result.
Step 1209: regarding the highest mood classification of confidence level in the first text Emotion identification result as text Emotion identification
As a result, or will be set in the highest mood classification of confidence level in the first text Emotion identification result and the second text Emotion identification result
The highest mood classification of reliability is collectively as text Emotion identification result.
When the confidence level of the highest mood classification of confidence level in the first text Emotion identification result is equal to the second text mood
Including the second text in the confidence level or the first text Emotion identification result of the highest mood classification of confidence level in recognition result
The highest mood classification of confidence level in this Emotion identification result, or even if in the first text Emotion identification result include the second text
When the emotional intensity rank of the highest mood classification of confidence level but mood classification is not high enough in this Emotion identification result, illustrate this
When be not possible to export a unified mood classification according to the first text Emotion identification result and the second text Emotion identification result
As final text Emotion identification result.At this point, in an embodiment of the present invention, it is contemplated that the first text mood has been selected to know
Other result is exported as the text Emotion identification result mainly considered, therefore directly by confidence in the first text Emotion identification result
Highest mood classification is spent as text Emotion identification result.It in an alternative embodiment of the invention, can also be by the first text
Emotion identification result and the second text Emotion identification result are collectively as text Emotion identification result.And it is tied during subsequent
The text Emotion identification result and basic intent information of passing user speech message and/or subsequent user speech message are closed,
Determine corresponding mood intent information.
Fig. 8 show another embodiment of the present invention provides speech recognition exchange method in determine text Emotion identification result
Flow diagram.Different from embodiment shown in Fig. 7, in embodiment shown in Fig. 8 in selected the second text Emotion identification
As a result as the text Emotion identification result output mainly considered, and the first text Emotion identification result is considered as auxiliary
The output of text Emotion identification result.It should be appreciated that the process of the determination text Emotion identification result can be similar to shown in Fig. 7 at this time
Flow logic, be only will as mainly consider text Emotion identification result output change for the second text Emotion identification knot
Fruit specifically may include following steps, but duplicate logical description repeats no more:
Step 1301: the confidence level and the second text mood for calculating mood classification in the first text Emotion identification result are known
The confidence level that mood is classified in other result.
Step 1302: judging the highest mood classification of confidence level and the second text mood in the first text Emotion identification result
Whether the highest mood classification of confidence level is identical in recognition result.If it is judged that be it is yes, then follow the steps 1303, otherwise hold
Row step 1304.
Step 1303: the highest mood classification of confidence level in the first text Emotion identification result or the second text mood are known
The highest mood classification of confidence level is used as text Emotion identification result in other result.
Step 1304: comparing the confidence level and first of the highest mood classification of confidence level in the second text Emotion identification result
The confidence level of the highest mood classification of confidence level in text Emotion identification result.
If the confidence level of the highest mood classification of confidence level is greater than the first text feelings in the second text Emotion identification result
The confidence level of the highest mood classification of confidence level, executes step 1305 in thread recognition result;If the second text Emotion identification knot
The confidence level of the highest mood classification of confidence level is less than the highest mood of confidence level in the first text Emotion identification result point in fruit
The confidence level of class executes step 1306;If the confidence of the highest mood classification of confidence level in the second text Emotion identification result
Degree is equal to the confidence level of the highest mood classification of confidence level in the first text Emotion identification result, executes step 1309.
Step 1305: regarding the highest mood classification of confidence level in the second text Emotion identification result as text Emotion identification
As a result.
Step 1306: judging in the second text Emotion identification result whether to include setting in the first text Emotion identification result
The highest mood classification of reliability.If it is judged that be it is yes, then follow the steps 1307;If it is judged that be it is no, then execute step
Rapid 1309.
Step 1307: further judging confidence in the first text Emotion identification result in the second text Emotion identification result
Whether the emotional intensity rank for spending highest mood classification is greater than the first intensity threshold.If the result further judged be it is yes,
Then follow the steps 1308;It is no to then follow the steps 1309.
Step 1308: regarding the highest mood classification of confidence level in the first text Emotion identification result as text Emotion identification
As a result.
Step 1309: regarding the highest mood classification of confidence level in the second text Emotion identification result as text Emotion identification
As a result, or will be set in the highest mood classification of confidence level in the second text Emotion identification result and the first text Emotion identification result
The highest mood classification of reliability is collectively as text Emotion identification result.
It should be appreciated that although the embodiment of Fig. 7 and Fig. 8 gives the example of determining text Emotion identification result, according to
First text Emotion identification result is different with the concrete form of the second text Emotion identification result, this knows according to the first text mood
Other result and the comprehensive process for determining text Emotion identification result of the second text Emotion identification result can also take other modes real
Existing, and be not limited to Fig. 7 and embodiment shown in Fig. 8, the present invention is to this and without limitation.
In an embodiment of the present invention, the first text Emotion identification result and the second text Emotion identification result respectively correspond
A coordinate points in multidimensional emotional space, at this time can be by the first text Emotion identification result and the second text Emotion identification knot
The coordinate value of coordinate points of the fruit in multidimensional emotional space is weighted and averaged processing, the coordinate obtained after weighted average is handled
Point is used as text Emotion identification result.For example, when using PAD three dimensional mood model, the first text Emotion identification result characterization
For (p1, a1, d1), the second text Emotion identification result is characterized as (p2, a2, d2), then final text Emotion identification result
It just may be characterized as ((p1+p2)/2, (a1+1.3*a2)/2, (d1+0.8*d2)/2), therein 1.3 and 0.8 is weight coefficient.It adopts
It is more convenient for calculating final text Emotion identification result in a manner of quantization with non-discrete dimension mood model.However it should
Understand, combination mode is not limited to above-mentioned weighted average processing, and the present invention is to when the first text Emotion identification result
Text Emotion identification knot is determined when respectively corresponding a coordinate points in multidimensional emotional space with the second text Emotion identification result
The concrete mode of fruit is without limitation.
Fig. 9 show in the speech recognition exchange method of one embodiment of the invention offer and obtains base according to user speech message
The flow diagram of this intent information.As shown in figure 9, the process of the basic intent information of the acquisition may include following steps:
Step 1401: preset semantic templates multiple in the content of text and semantic knowledge-base of user speech message are carried out
Matching is with the matched semantic template of determination;Wherein the corresponding relationship between semantic template and basic intent information is pre-established in language
In adopted knowledge base, the corresponding one or more semantic templates of same intent information.
It should be appreciated that carrying out the matching (such as standard ask, extend and ask semantic template) of semanteme by semantic template is one
Kind implementation, the speech text information of user's input directly can also extract word, word, sentence vector characteristics by network (may
Attention mechanism is added) directly matches or classify.
Step 1402: obtaining basic intent information corresponding with matched semantic template.
In an embodiment of the present invention, the content of text of user speech message can be right with " standard is asked " in semantic knowledge-base
It answers, " standard is asked " is used to indicate that the text of some knowledge point, and main target is that expression is clear, convenient for safeguarding.Here " asking "
It narrowly should not be interpreted as " inquiring ", and should broadly understand one " input ", being somebody's turn to do " input " has corresponding " output ".With
Family to intelligent interaction machine when inputting, the most ideal situation is that asked using standard, then the intelligent semantic identifying system horse of machine
Above it will be appreciated that the meaning of user.
However, user often not uses standard to ask, but the form of some deformations that standard is asked, as extend
It asks.Therefore, for intelligent semantic identification, the extension that the standard that needs in knowledge base is asked is asked, which, which asks, asks table with standard
There is slight difference up to form, but expresses identical meaning.Therefore, in a further embodiment of the invention, semantic template is
The set for indicating one or more semantic formulas of a certain semantic content combines language according to scheduled rule by developer
Adopted content generates, i.e., the sentence of a variety of different expression ways of corresponding semantic content can be described by a semantic template,
The possible various deformation of content of text to cope with user speech message.In this way by the content of text of user speech message and default
Semantic template matched, avoid using " standard is asked " for being only capable of describing a kind of expression way and identify user speech message
When limitation.
Ontology generic attribute is done for example, by using abstract semantics and is further abstracted.The abstract semantics of one classification pass through one group of pumping
The different expression of a kind of abstract semantics are described as the set of semantic formula, to express more abstract semanteme, these are abstracted
Semantic formula is expanded on component.
It should be appreciated that the example of semantic component word, semantic rules word and semantic symbol, but the particular content of semantic component word
And part of speech, the definition and collocation of the particular content and part of speech and semantic symbol of semantic rules word all can be by developers according to this
Specific interactive service scene applied by speech recognition exchange method and preset, the present invention is to this and without limitation.
In an embodiment of the present invention, the process of matched semantic template is determined according to the content of text of user speech message
It can be realized by similarity calculation process.Specifically, calculating the content of text and multiple preset semantemes of user speech message
Multiple text similarities between template, then using the highest semantic template of text similarity as matched semantic template.Phase
It can be used one of following calculation method or a variety of: editing distance calculation method like degree, n-gram calculation method,
JaroWinkler calculation method and Soundex calculation method.In a further embodiment, when identifying that user speech disappears
When semantic component word and semantic rules word in the content of text of breath, in user speech message and semantic template it is included it is semantic at
Participle and semantic rules word can also be converted to simplified text-string, to improve the efficiency of Semantic Similarity Measurement.
In an embodiment of the present invention, as previously mentioned, semantic template can be made of semantic component word and semantic rules word, and
These semantic component words and semantic rules word are closed with these words in the part of speech in semantic template and the grammer between word again
It is related, therefore the similarity calculation process can specifically: first identify the word of word in user speech Message-text, word
Property and grammatical relation, then identify semantic component word and semantic rules therein according to the part of speech of word and grammatical relation
Word, then the semantic component word identified and semantic rules word are introduced into vector space model to calculate the text of user speech message
Multiple similarities between this content and multiple preset semantic templates.It in an embodiment of the present invention, can the side of participle as follows
Word in the content of text of one of method or a variety of identification user speech message, the language between the part of speech and word of word
Method relationship: hidden markov model approach, Forward Maximum Method method, reverse maximum matching process and name Entity recognition side
Method.
In an embodiment of the present invention, as previously mentioned, semantic template can be multiple semantemes of a certain semantic content of expression
The set of expression formula can describe the language of a variety of different expression ways of corresponding semantic content by a semantic template at this time
Sentence, is asked with multiple extensions that the same standard of correspondence is asked.Therefore in the content of text and preset semanteme for calculating user speech message
When semantic similarity between template, need to calculate the content of text of user speech message with multiple preset semantic templates respectively
At least one extension of expansion ask between similarity, then using the highest extension of similarity ask corresponding semantic template as
Matched semantic template.The extension of these expansion is asked can be according to the semantic component word and/or semantic rules included by semantic template
Word and/or semantic symbol and obtain.
Certainly the method for obtaining basic intent information is not limited to this, and the speech text information of user's input can directly lead to
It crosses network and extracts word, word, sentence vector characteristics (attention mechanism may such as be added) and directly match or be categorized into and be intended to letter substantially
Breath is to realize.
It can be seen that speech recognition exchange method provided by through the embodiment of the present invention is, it can be achieved that according to user emotion
State is different and provides the intelligent interaction mode of different answer services, is thus greatly improved the experience of intelligent interaction.For example, working as
Speech recognition exchange method provided by the embodiment of the present invention is applied in the tangible machine people in bank's customer service field, user's term
Sound says entity customer service robot: " credit card to report the loss what if? ".Entity customer service robot receives user's language by microphone
Sound message, and it is " anxiety " that the audio data by analyzing user speech message, which obtains audio Emotion identification result, and by audio
Emotion identification result is as final Emotion identification result;User speech message is converted into text, obtains the basic meaning of client
Figure information be " reporting the loss credit card " (the step for may also need to be related to combine passing or subsequent user speech message and silver
The semantic knowledge-base in row field);Then, Emotion identification result " anxiety " and basic intent information " reporting the loss credit card " is contacted
Together, obtain mood intent information " reporting the loss credit card, user is very anxious, and possible credit card is lost or stolen " (the step for
The semantic knowledge-base of passing or subsequent user speech message and the bank field may be needed to be related to combining);It determines corresponding
Interactive instruction: screen export credit jam step-out is rapid, while mood classification " comfort ", emotional intensity grade is presented by voice broadcast
Wei not be high, export to user meet that the mood instructs may be brisk, the medium word speed of tone voice broadcast: " report the loss credit
The step of card, shows see screen, woulds you please not worry, if it is losing credit card or is stolen, jam is freezed at once after losing, no
It can cause damages to your property and prestige ... ".
In an embodiment of the present invention, application scenes (such as bank's customer service) may also consider the privacy of interaction content
Property and avoid voice broadcast from operating, and be changed to realize interactive instruction in a manner of plain text or animation.The mould of this interactive instruction
State selection can be adjusted according to application scenarios.
It should be appreciated that can be by adjusting voice for the presentation mode of mood classification and emotional intensity rank in interactive instruction
The modes such as the word speed of casting and intonation realize which is not limited by the present invention.
It applies for another example working as speech recognition exchange method provided by the embodiment of the present invention in the virtual of intelligent terminal
When in intelligent personal assistants application, user says intelligent terminal with voice: " most fast path is assorted from family to airport
? ".Virtual Intelligent personal assistant applications receive user speech message by the microphone of intelligent terminal, and pass through analysis
It is " excitement " that the audio data of user speech message, which obtains audio Emotion identification result,;It is simultaneously text by user speech message transformation
This, and it is " anxiety " that the content of text by analyzing user speech message, which obtains text Emotion identification result, by logic judgment
By " excitement " and " anxiety " two kinds of mood classification simultaneously as Emotion identification result.By combining passing or subsequent user's language
The basic intent information that the semantic knowledge-base of sound message and this field obtains client is " to obtain the user road most fast from the home to airport
Diameter navigation ".Since " anxiety " and basic intent information " it is most fast from the home to airport to be obtained user by Virtual Intelligent personal assistant applications
Path navigation " the mood intent information that links together is " to obtain user's path navigation most fast from the home to airport, use
Family is very anxious, may worry overdue aircraft ";And the mood that " excitement " and basic intent information link together is intended to believe
Breath is " obtaining user's path navigation most fast from the home to airport, user is very excited, may travel at once ";Therefore, here
Two kinds of mood intent informations can be generated, at this time in combination with passing or subsequent user speech message, it is found that front user mentions " I
Flight be at 11 points and take off, need several points to set out? ", judge the Emotion identification result of user then for " anxiety ", mood is intended to letter
Breath is " obtaining user's path navigation most fast from the home to airport, user is very anxious, may worry overdue aircraft ".It determines corresponding
Interactive instruction: screen exports navigation information, while mood classification " comfort " and " warning ", emotional intensity is presented by voice broadcast
Rank be respectively height, export to user meet the mood instruct may be smooth tone, medium word speed voice broadcast: " from
You finish home address to the most fast path planning in airport, please show and navigate by screen, and normally travel is estimated can be at 1 hour
It inside arrives at the airport, woulds you please not worry.In addition it reminds and carries out time planning, drive with caution, drive under the speed limit."
It is applied in a kind of intelligent wearable device for another example working as speech recognition exchange method provided by the embodiment of the present invention
When, user says intelligent wearable device with voice when movement: " my present heartbeat what state? ".Intelligence wearing is set
It is standby that user speech message is received by microphone, and the audio data by analyzing user speech message obtains audio Emotion identification
It as a result is PAD three dimensional mood model vector (p1, a1, d1), the audio data by analyzing user speech message obtains text feelings
Thread recognition result is PAD three dimensional mood model vector (p2, a2, d2), in conjunction with audio Emotion identification result and text Emotion identification
As a result final Emotion identification result (p3, a3, d3) is obtained, the combination of " worry " and " anxiety " is characterized.At the same time, intelligence
Wearable device is " to obtain the heart of user by combining the semantic knowledge-base in medical treatment & health field to obtain the basic intent information of client
Hop count evidence ".Then, Emotion identification result (p3, a3, d3) and basic be intended to " heartbeat data of acquisition user " are contacted one
It rises, obtaining mood intent information is " to obtain the heartbeat data of user, user concerns, and may currently have rapid heart beat etc. no
Suitable symptom ".Interactive instruction is determined according to the corresponding relationship between mood intent information and interactive instruction: in output heartbeat data
Mood (p6, a6, d6) is presented simultaneously, i.e., " comforts " and the combination of " encouragement ", emotional intensity is respectively height, while starting prison in real time
The program of control heartbeat continues 10min, and the voice broadcast of word speed brisk with tone, slow: " your current heartbeat data is every point
Clock 150 times, would you please not worry, which still belongs to normal heartbeat range.If any feeling that the malaise symptoms such as rapid heart beat please put
Feelings of getting relaxed, which are breathed deeply, to be adjusted.Your previous health data shows that heart working is good, can be by keeping regular forging
Refining enhancing cardio-pulmonary function." then give more sustained attention the emotional state of user.It " is wrong with if user says after 5min." pass through
It is three dimensional mood model vector (p7, a7, d7) that Emotion identification process, which obtains Emotion identification result, characterizes " pain ", then again
Updating interactive instruction are as follows: screen exports heartbeat data, while mood (p8, a8, d8) is presented by voice broadcast, i.e., " warns ",
Emotional intensity is respectively high, exports alarm sound, and the voice broadcast of word speed sedate with tone, slow: " your current beats
It has been more than normal range (NR) according to being 170 times per minute, has woulded you please stop motion, adjustment breathing.If you need to seek help please by screen."
Figure 10 is a kind of structural schematic diagram for speech recognition interactive device that one embodiment of the invention provides.Such as Figure 10 institute
Show, which includes: that Emotion identification module 11, basic intention assessment module 12 and interactive instruction determine
Module 13.
Emotion identification module 11 is configured to obtain Emotion identification result according to user speech message, wherein Emotion identification knot
Audio Emotion identification is included at least in fruit as a result, or including at least audio Emotion identification result and text feelings in Emotion identification result
Thread recognition result.Basic intention assessment module 12 is configured to carry out intention analysis according to the content of text of user speech message, obtains
To corresponding basic intent information.Interactive instruction determining module 13 is configured to according to Emotion identification result and basic intent information
Determine corresponding interactive instruction.
Speech recognition interactive device 10 provided in an embodiment of the present invention, on the basis for the basic intent information for understanding user
On, the Emotion identification based on the acquisition of user speech message is combined as a result, simultaneously further knowing according to basic intent information and mood
Other result provides the interactive instruction that band is in a bad mood, so that user can not be analyzed by solving intelligent interaction mode in the prior art
Profound the problem of being intended to and more humane interactive experience can not be provided of speech message.
In an embodiment of the present invention, as shown in figure 11, interactive instruction determining module 13 includes: mood intention assessment unit
131 and interactive instruction determination unit 132.Mood intention assessment unit 131 is configured to according to Emotion identification result and basic intention
Information determines corresponding mood intent information.Interactive instruction determination unit 132 is configured to be determined according to mood intent information and correspond to
Interactive instruction, or corresponding interactive instruction is determined according to mood intent information and basic intent information.
In an embodiment of the present invention, interactive instruction includes the feedback content presented to mood intent information.Such as one
Under a little customer service interaction scenarios, the mood intent information that will be analyzed according to the voice content of client is needed to be presented to contact staff,
To play reminding effect, corresponding mood intent information must be just determined at this time, and by the feedback to the mood intent information
Content shows.
In an embodiment of the present invention, interactive instruction includes that mode: text output feelings is presented in one or more of emotion
Mode is presented in sense, melody plays emotion presentation mode, mode is presented in speech emotional, mode and mechanical action feelings are presented in Image emotional semantic
Mode is presented in sense.
In an embodiment of the present invention, mood intent information includes affection need information corresponding with Emotion identification result;
Or, mood intent information includes affection need information corresponding with Emotion identification result and Emotion identification result and basic intention
The incidence relation of information.
In an embodiment of the present invention, the incidence relation of Emotion identification result and basic intent information is to preset.
In an embodiment of the present invention, user information includes at least user speech message;Wherein, Emotion identification module 11
It is further configured to: Emotion identification result is obtained according to user speech message.
In an embodiment of the present invention, as shown in figure 11, Emotion identification module 11 can include: audio Emotion identification unit
111, it is configured to obtain audio Emotion identification result according to the audio data of user speech message;And Emotion identification result is true
Order member 112 is configured to determine Emotion identification result according to audio Emotion identification result.
Or, Emotion identification module 11 includes: audio Emotion identification unit 111, it is configured to the sound according to user speech message
Frequency data acquisition audio Emotion identification result;Text Emotion identification unit 113 is configured in the text according to user speech message
Hold and obtains text Emotion identification result;And Emotion identification result determination unit 112, it is configured to according to audio Emotion identification knot
Fruit and text Emotion identification result determine Emotion identification result.
In an embodiment of the present invention, audio Emotion identification result includes one of multiple mood classification or a variety of;Or,
Audio Emotion identification result corresponds to a coordinate points in multidimensional emotional space.Or, audio Emotion identification result and text feelings
Thread recognition result respectively includes one of multiple mood classification or a variety of;Or, audio Emotion identification result and text mood
Recognition result respectively corresponds a coordinate points in multidimensional emotional space.Wherein, each dimension in multidimensional emotional space is corresponding
The emotional factor that one psychology defines also may include multiple emotional intensity ranks in the classification of each mood, can not also wrap
Emotional intensity rank is included, the present invention is without limitation.
In an embodiment of the present invention, audio Emotion identification result and text Emotion identification result respectively include multiple moods
One of classification is a variety of.Wherein, Emotion identification result determination unit 112 is further configured to: if audio Emotion identification
As a result include identical mood classification with text Emotion identification result, then regard the classification of identical mood as Emotion identification result.
In an embodiment of the present invention, Emotion identification result determination unit 112 is further configured to: if audio mood is known
Other result and text Emotion identification result do not include identical mood classification, then by audio Emotion identification result and text mood
Recognition result is collectively as Emotion identification result.
In an embodiment of the present invention, audio Emotion identification result and text Emotion identification result respectively include multiple moods
One of classification is a variety of.Wherein, Emotion identification result determination unit 112 includes: the first confidence calculations subelement 1121
With Emotion identification subelement 1122.
First confidence calculations subelement 1121 is configured to calculate the confidence level of mood classification in audio Emotion identification result
And the confidence level that mood is classified in text Emotion identification result.Emotion identification subelement 1122 is configured to when audio Emotion identification
As a result when the highest mood classification of middle confidence level is identical as the highest mood classification of confidence level in text Emotion identification result, by sound
The highest mood classification of confidence level in the classification of confidence level highest mood or text Emotion identification result in frequency Emotion identification result
As Emotion identification result.
In an embodiment of the present invention, Emotion identification subelement 1122 is further configured to, when audio Emotion identification result
When the middle highest mood classification of confidence level and the not identical highest mood classification of confidence level in text Emotion identification result, according to sound
The confidence level and confidence level in text Emotion identification result of the highest mood classification of confidence level are highest in frequency Emotion identification result
The size relation of the confidence level of mood classification, determines Emotion identification result.
In an embodiment of the present invention, according to the confidence level of the highest mood classification of confidence level in audio Emotion identification result
With the size relation of the confidence level of the highest mood classification of confidence level in text Emotion identification result, Emotion identification result packet is determined
It includes: when the confidence level of the highest mood classification of confidence level in audio Emotion identification result is greater than confidence in text Emotion identification result
When spending the confidence level of highest mood classification, the highest mood classification of confidence level in audio Emotion identification result is known as mood
Other result;And when the confidence level of the highest mood classification of confidence level in audio Emotion identification result is equal to text Emotion identification
When the confidence level of the as a result highest mood classification of middle confidence level, the highest mood of confidence level in audio Emotion identification result is classified
As Emotion identification as a result, or the highest mood of confidence level in audio Emotion identification result is classified and text Emotion identification result
The middle highest mood classification of confidence level is collectively as Emotion identification result.
In an embodiment of the present invention, according to the confidence level of the highest mood classification of confidence level in audio Emotion identification result
With the size relation of the confidence level of the highest mood classification of confidence level in text Emotion identification result, determine Emotion identification result into
One step includes: when the confidence level of the highest mood classification of confidence level in audio Emotion identification result is less than text Emotion identification result
When the confidence level of the middle highest mood classification of confidence level, judge in audio Emotion identification result whether to include text Emotion identification
As a result the highest mood classification of middle confidence level;And if it is judged that be it is yes, then further judge audio Emotion identification result
In text Emotion identification result in the emotional intensity rank of confidence level highest mood classification whether be greater than the first intensity threshold,
Then the highest mood classification of confidence level in text Emotion identification result is regard as Emotion identification knot if it is greater than the first intensity threshold
Fruit;If it is judged that or the result that further judges be it is no, then by the highest mood of confidence level in audio Emotion identification result
Classification is as Emotion identification as a result, or the highest mood of confidence level in audio Emotion identification result is classified and text Emotion identification
As a result the highest mood classification of middle confidence level is collectively as Emotion identification result.
In an embodiment of the present invention, according to the confidence level of the highest mood classification of confidence level in text Emotion identification result
With the size relation of the confidence level of the highest mood classification of confidence level in audio Emotion identification result, Emotion identification result packet is determined
It includes: when the confidence level of the highest mood classification of confidence level in text Emotion identification result is greater than confidence in audio Emotion identification result
When spending the confidence level of highest mood classification, the highest mood classification of confidence level in text Emotion identification result is known as mood
Other result;And when the confidence level of the highest mood classification of confidence level in text Emotion identification result is equal to audio Emotion identification
When the confidence level of the as a result highest mood classification of middle confidence level, the highest mood of confidence level in text Emotion identification result is classified
As Emotion identification as a result, or the highest mood of confidence level in text Emotion identification result is classified and audio Emotion identification result
The middle highest mood classification of confidence level is collectively as Emotion identification result.
In an embodiment of the present invention, according to the confidence level of the highest mood classification of confidence level in text Emotion identification result
With the size relation of the confidence level of the highest mood classification of confidence level in audio Emotion identification result, determine Emotion identification result into
One step includes: when the confidence level of the highest mood classification of confidence level in text Emotion identification result is less than audio Emotion identification result
When the confidence level of the middle highest mood classification of confidence level, judge in text Emotion identification result whether to include audio Emotion identification
As a result the highest mood classification of middle confidence level;And if it is judged that be it is yes, then further judge text Emotion identification result
In audio Emotion identification result in the emotional intensity rank of confidence level highest mood classification whether be greater than the second intensity threshold,
Then the highest mood classification of confidence level in audio Emotion identification result is regard as Emotion identification knot if it is greater than the second intensity threshold
Fruit;If it is judged that or the result that further judges be it is no, then by the highest mood of confidence level in text Emotion identification result
Classification is as Emotion identification as a result, or the highest mood of confidence level in text Emotion identification result is classified and audio Emotion identification
As a result the highest mood classification of middle confidence level is collectively as Emotion identification result.
In an embodiment of the present invention, audio Emotion identification result and text Emotion identification result respectively correspond multidimensional emotion
A coordinate points in space.Wherein Emotion identification result determination unit 112 is further configured to: by audio Emotion identification result
It is weighted and averaged processing with the coordinate value of coordinate points of the text Emotion identification result in multidimensional emotional space, will be weighted and averaged
The coordinate points obtained after processing are as Emotion identification result.
Figure 12 is a kind of structural schematic diagram for speech recognition interactive device that one embodiment of the invention provides.Such as Figure 11 institute
Show, the audio Emotion identification unit 111 in the speech recognition interactive device 10 include: audio feature extraction subelement 1111,
Subelement 1113 is determined with subelement 1112 and audio mood.
Audio feature extraction subelement 1111 is configured to extract the audio feature vector of user speech message, wherein user's language
Sound message corresponds to one section of word in audio stream to be identified.Coupling subelement 1112 is configured to the audio of user speech message is special
Sign vector is matched with multiple emotional characteristics models, and plurality of emotional characteristics model respectively corresponds in multiple mood classification
One.Audio mood determines that subelement 1113 is configured to be the corresponding mood of emotional characteristics model to match by matching result
Classify and classifies as the mood of user speech message.
In an embodiment of the present invention, as shown in figure 12, audio Emotion identification unit 111 further comprises: mood model
Subelement 1114 is established, is configured to by including that multiple moods are classified multiple default voice sheets of corresponding mood tag along sort
The respective audio feature vector of section carries out pre- study to establish multiple emotional characteristics models.
In an embodiment of the present invention, it includes: cluster subelement and training subelement that mood model, which establishes subelement 1114,.
Cluster subelement be configured to will include multiple moods classify corresponding mood tag along sort multiple default sound bites it is respective
Audio feature vector set carries out clustering processing, obtains the cluster result of default mood classification.Training subelement is configured to basis
The audio feature vector set of default sound bite in each cluster is trained for an emotional characteristics model by cluster result.
In an embodiment of the present invention, as shown in figure 13, audio Emotion identification unit 111 further comprises: sentence endpoint
Detection sub-unit 1115 and extraction subelement 1116.Sentence end-point detection subelement 1115 is configured to determine audio stream to be identified
In voice start frame and voice end frame.Extract subelement 1116 be configured to extract voice start frame and voice end frame it
Between audio stream part as user speech message.
In an embodiment of the present invention, sentence end-point detection subelement 1115 includes: that the first judgment sub-unit, voice start
Frame determines that subelement and voice end frame determine subelement.
First judgment sub-unit is configured to judge that the speech frame in audio stream to be identified is pronunciation frame or non-vocal frame.Language
Sound start frame determines that subelement is configured to after the voice end frame of the preceding paragraph sound bite or current unidentified to first
When section sound bite, when there is the first preset quantity speech frame to be continuously judged as pronunciation frame, by the first preset quantity language
Voice start frame of first speech frame as current speech segment in sound frame.Voice end frame determines that subelement is configured to
After the voice start frame of current speech segment, when there is the second preset quantity speech frame to be continuously judged as non-vocal frame,
Using first speech frame in the second preset quantity speech frame as the voice end frame of current speech segment.
In an embodiment of the present invention, audio feature vector includes one of following several audio frequency characteristics or a variety of: energy
Measure feature, pronunciation frame number feature, fundamental frequency feature, formant feature, harmonic wave are made an uproar than feature and mel cepstrum coefficients feature.
In an embodiment of the present invention, energy feature includes: that short-time energy first-order difference and/or predeterminated frequency are below
Energy size;And/or fundamental frequency feature includes: fundamental frequency and/or fundamental frequency first-order difference;And/or formant is special
Sign includes one of following items or a variety of: the first formant, the second formant, third formant, the first formant single order
Difference, the second formant first-order difference and third formant first-order difference;And/or mel cepstrum coefficients feature includes 1-12
Rank mel cepstrum coefficients and/or 1-12 rank mel cepstrum coefficients first-order difference.
In an embodiment of the present invention, audio frequency characteristics by one of following computational representation mode or a variety of characterize:
Ratio value, mean value, maximum value, intermediate value and standard deviation.
In an embodiment of the present invention, energy feature include: the mean value, maximum value, intermediate value of short-time energy first-order difference with
And the ratio value of standard deviation and/or predeterminated frequency energy below and total energy;And/or pronunciation frame number feature includes: hair
The ratio value of sound frame number and mute frame number, and/or the ratio value of pronunciation frame number and totalframes;Fundamental frequency feature includes: base
The mean value of mean value, maximum value, intermediate value and the standard deviation of voice frequency and/or fundamental frequency first-order difference, maximum value, intermediate value with
And standard deviation;And/or formant feature includes one of following items or a variety of: the mean value of the first formant, maximum value,
Intermediate value and standard deviation, mean value, maximum value, intermediate value and the standard deviation of the second formant, mean value, the maximum of third formant
Value, intermediate value and standard deviation, mean value, maximum value, intermediate value and the standard deviation of the first formant first-order difference, the second formant one
The mean value of mean value, maximum value, intermediate value and the standard deviation of order difference and third formant first-order difference, maximum value, intermediate value with
And standard deviation;And/or mel cepstrum coefficients feature includes mean value, maximum value, intermediate value and the mark of 1-12 rank mel cepstrum coefficients
Quasi- poor and/or 1-12 rank mel cepstrum coefficients first-order difference mean value, maximum value, intermediate value and standard deviation.
Figure 14 is a kind of structural schematic diagram for speech recognition interactive device that one embodiment of the invention provides.Such as Figure 14 institute
Show, the text Emotion identification unit 113 in the speech recognition interactive device 10 includes: the first text Emotion identification subelement
1131, the second text Emotion identification subelement 1132 and text mood determine subelement 1133.
First text Emotion identification subelement 1131 is configured to the mood word in the content of text of identification user speech message
It converges, the first text Emotion identification result is determined according to the mood vocabulary identified.Second text Emotion identification subelement 1132,
It is configured to the content of text of user speech message inputting text Emotion identification deep learning model, text Emotion identification depth
It practises model to establish based on the content of text for including mood tag along sort and emotional intensity grade distinguishing label is trained, by text
The output result of Emotion identification deep learning model is as the second text Emotion identification result.Text mood determines subelement 1133
It is configured to determine text Emotion identification result according to the first text Emotion identification result and the second text Emotion identification result.
In an embodiment of the present invention, the first text Emotion identification result includes one of multiple mood classification or more
Kind;Or, the first text Emotion identification result corresponds to a coordinate points in multidimensional emotional space.Or, the first text Emotion identification
As a result and the second text Emotion identification result respectively includes one of multiple mood classification or a variety of;Or, the first text feelings
Thread recognition result and the second text Emotion identification result respectively correspond a coordinate points in multidimensional emotional space.Wherein, more
Each dimension in dimension emotional space corresponds to the emotional factor that a psychology defines, and also may include more in each mood classification
A emotional intensity rank can not also include emotional intensity rank, and the present invention is with no restrictions.
In an embodiment of the present invention, the first text Emotion identification result and the second text Emotion identification result respectively include
One of multiple mood classification are a variety of.Wherein text mood determines that subelement 1133 is further configured to: if the first text
This Emotion identification result and the second text Emotion identification result include identical mood classification, then by the classification conduct of identical mood
Text Emotion identification result.
In an embodiment of the present invention, text mood determines that subelement 1133 is further configured to: if the first text feelings
Thread recognition result and the second text Emotion identification result do not include identical mood classification, then by the first text Emotion identification knot
Fruit and the second text Emotion identification result are collectively as text Emotion identification result.
In an embodiment of the present invention, the first text Emotion identification result and the second text Emotion identification result respectively include
One of multiple mood classification are a variety of.Wherein, text mood determines that subelement 1133 includes: that the second confidence calculations is single
Member 11331 and text emotion judgment subelement 11332.
Second confidence calculations subelement 11331 be configured to calculate the first text Emotion identification result in mood classification set
The confidence level that mood is classified in reliability and the second text Emotion identification result.Text emotion judgment subelement 11332 is configured to
When the highest mood of confidence level is classified with confidence level in the second text Emotion identification result most in the first text Emotion identification result
When high mood classification is identical, by the highest mood classification of confidence level in the first text Emotion identification result or the second text mood
The highest mood classification of confidence level is used as text Emotion identification result in recognition result.
In an embodiment of the present invention, text emotion judgment subelement 11332 is further configured to: when the first text mood
The highest mood classification of confidence level is classified not with the highest mood of confidence level in the second text Emotion identification result in recognition result
When identical, known according to the confidence level of the highest mood classification of confidence level in the first text Emotion identification result and the second text mood
The size relation of the confidence level of the highest mood classification of confidence level, determines Emotion identification result in other result.
In an embodiment of the present invention, setting according to the highest mood classification of confidence level in the first text Emotion identification result
The size relation of the confidence level of the highest mood classification of confidence level, determines that mood is known in reliability and the second text Emotion identification result
Other result includes: when the confidence level of the highest mood classification of confidence level in the first text Emotion identification result is greater than the second text feelings
In thread recognition result when the confidence level of the highest mood classification of confidence level, by confidence level highest in the first text Emotion identification result
Mood classification be used as text Emotion identification result;And when the highest mood of confidence level in the first text Emotion identification result
When the confidence level of classification is equal to the confidence level of the highest mood classification of confidence level in the second text Emotion identification result, by the first text
The classification of confidence level highest mood is as text Emotion identification as a result, or by the first text Emotion identification in this Emotion identification result
As a result the highest mood classification of confidence level is common in the highest mood classification of middle confidence level and the second text Emotion identification result makees
For text Emotion identification result.
Setting according to the highest mood classification of confidence level in the first text Emotion identification result in an embodiment of the present invention
The size relation of the confidence level of the highest mood classification of confidence level, determines that mood is known in reliability and the second text Emotion identification result
Other result further comprises: when the confidence level of the highest mood classification of confidence level in the first text Emotion identification result is less than second
In text Emotion identification result when the confidence level of confidence level highest mood classification, judge be in the first text Emotion identification result
No includes the highest mood classification of confidence level in the second text Emotion identification result;And if it is judged that be it is yes, then into
One step judges the highest mood classification of confidence level in the second text Emotion identification result in the first text Emotion identification result
Whether emotional intensity rank is greater than the first intensity threshold, if it is greater than the first intensity threshold then by the second text Emotion identification result
The middle highest mood classification of confidence level is used as text Emotion identification result;If it is judged that or the result that further judges be
It is no, then by the highest mood classification of confidence level in the first text Emotion identification result as text Emotion identification as a result, or by the
The highest mood classification of confidence level and confidence level in the second text Emotion identification result are highest in one text Emotion identification result
Mood is classified collectively as text Emotion identification result.
In an embodiment of the present invention, setting according to the highest mood classification of confidence level in the first text Emotion identification result
The size relation of the confidence level of the highest mood classification of confidence level, determines that mood is known in reliability and the second text Emotion identification result
Other result includes: when the confidence level of the highest mood classification of confidence level in the second text Emotion identification result is greater than the first text feelings
In thread recognition result when the confidence level of the highest mood classification of confidence level, by confidence level highest in the second text Emotion identification result
Mood classification be used as text Emotion identification result;And when the highest mood of confidence level in the second text Emotion identification result
When the confidence level of classification is equal to the confidence level of the highest mood classification of confidence level in the first text Emotion identification result, by the second text
The classification of confidence level highest mood is as text Emotion identification as a result, or by the second text Emotion identification in this Emotion identification result
As a result the highest mood classification of confidence level is common in the highest mood classification of middle confidence level and the first text Emotion identification result makees
For text Emotion identification result.
In an embodiment of the present invention, setting according to the highest mood classification of confidence level in the first text Emotion identification result
The size relation of the confidence level of the highest mood classification of confidence level, determines that mood is known in reliability and the second text Emotion identification result
Other result further comprises: when the confidence level of the highest mood classification of confidence level in the second text Emotion identification result is less than first
In text Emotion identification result when the confidence level of confidence level highest mood classification, judge be in the second text Emotion identification result
No includes the highest mood classification of confidence level in the first text Emotion identification result;And if it is judged that be it is yes, then into
One step judges the highest mood classification of confidence level in the first text Emotion identification result in the second text Emotion identification result
Whether emotional intensity rank is greater than the second intensity threshold, if it is greater than the second intensity threshold then by the first text Emotion identification result
The middle highest mood classification of confidence level is used as text Emotion identification result;If it is judged that or the result that further judges be
It is no, then by the highest mood classification of confidence level in the second text Emotion identification result as text Emotion identification as a result, or by the
The highest mood classification of confidence level and confidence level in the first text Emotion identification result are highest in two text Emotion identification results
Mood is classified collectively as text Emotion identification result.
In an embodiment of the present invention, the first text Emotion identification result and the second text Emotion identification result respectively correspond
A coordinate points in multidimensional emotional space.Wherein text mood determines that subelement 1133 is further configured to: by the first text
The coordinate value of the coordinate points of Emotion identification result and the second text Emotion identification result in multidimensional emotional space is weighted flat
It handles, the coordinate points obtained after weighted average is handled are as text Emotion identification result.
Figure 15 is a kind of structural schematic diagram for speech recognition interactive device that one embodiment of the invention provides.Such as Figure 15 institute
Show, the basic intention assessment module 12 in the speech recognition interactive device 10 includes: semantic template matching unit 121 and basic meaning
Figure acquiring unit 122.
Semantic template matching unit 121 is configured to multiple pre- in content of text and semantic knowledge-base by user speech message
If semantic template matched with the matched semantic template of determination.The basic acquiring unit 122 that is intended to is configured to obtain and match
The corresponding basic intent information of semantic template.Wherein the corresponding relationship between semantic template and basic intent information pre-establishes
In semantic knowledge-base, the corresponding one or more semantic templates of same intent information.
In an embodiment of the present invention, semantic template matching unit 121 includes: similarity calculation subelement 1211 and semanteme
Template determines subelement 1212.
Similarity calculation subelement 1211 is configured to the content of text of user speech message and multiple preset semantic moulds
Similarity calculation is carried out between plate.Semantic template determine subelement 1212 be configured to using the highest semantic template of similarity as
The semantic template matched.
In an embodiment of the present invention, corresponding between Emotion identification result and basic intent information and mood intent information
Relationship is to pre-establish;Or, the corresponding relationship between mood intent information and interactive instruction is to pre-establish;Or, mood is intended to
Corresponding relationship between information and basic intent information and interactive instruction is to pre-establish.
In an embodiment of the present invention, basic to be intended to know in order to further increase the accuracy that basic intent information obtains
Other module 12 is further configured to: according to current user speech message, and combining passing user speech message and/or subsequent
User speech message carry out intention analysis, obtain corresponding basic intent information.
In an embodiment of the present invention, in order to further increase mood intent information acquisition accuracy, the speech recognition
Interactive device 10 further comprises: the first logging modle, is configured to the Emotion identification result of record user speech message and basic
Intent information.Wherein, mood intention assessment unit 131 is further configured to: being known according to the mood of current user speech message
Other result and basic intent information, and the mood of passing user speech message and/or subsequent user speech message is combined to know
Other result and basic intent information determine corresponding mood intent information.
In an embodiment of the present invention, in order to further increase the accuracy that interactive instruction obtains, the speech recognition is interactive
Device 10 further comprises: the second logging modle, is configured to the mood intent information and basic intention of record user speech message
Information.Wherein, interactive instruction determination unit 132 is further configured to: being intended to believe according to the mood of current user speech message
Breath and basic intent information, and the mood of passing user speech message and/or subsequent user speech message is combined to be intended to letter
Breath and basic intent information, determine corresponding interactive instruction.
It should be appreciated that each module or unit recorded in speech recognition interactive device 10 provided by above-described embodiment
It is corresponding with a method and step above-mentioned.Operation, feature and the effect of method and step description above-mentioned are equally applicable to as a result,
Speech recognition interactive device 10 and its included in corresponding module and unit, details are not described herein for duplicate content.
One embodiment of the invention also provides a kind of computer equipment, including memory, processor and is stored in memory
On the computer program that is executed by processor, which is characterized in that processor realizes such as preceding any implementation when executing computer program
Speech recognition exchange method described in example.
One embodiment of the invention also provides a kind of computer readable storage medium, is stored thereon with computer program, special
Sign is, the speech recognition exchange method as described in preceding any embodiment is realized when computer program is executed by processor.The meter
Calculation machine storage medium can be any tangible media, such as floppy disk, CD-ROM, DVD, hard disk drive, even network medium etc..
Although being produced it should be appreciated that can be computer program the foregoing describe a kind of way of realization of embodiment of the present invention
Product, but the method or apparatus of embodiments of the present invention can be come in fact according to the combination of software, hardware or software and hardware
It is existing.Hardware components can use special logic to realize;Software section can store in memory, by instruction execution appropriate
System, such as microprocessor or special designs hardware execute.It will be understood by those skilled in the art that above-mentioned side
Method and equipment can be used computer executable instructions and/or is included in the processor control code to realize, such as such as
Disk, the mounting medium of CD or DVD-ROM, the programmable memory of such as read-only memory (firmware) or such as optics or
Such code is provided in the data medium of electrical signal carrier.Methods and apparatus of the present invention can be by such as ultra-large
The semiconductor or such as field programmable gate array of integrated circuit or gate array, logic chip, transistor etc. can be compiled
The hardware circuit realization of the programmable hardware device of journey logical device etc., can also be soft with being executed by various types of processors
Part is realized, can also be realized by the combination such as firmware of above-mentioned hardware circuit and software.
It will be appreciated that though it is referred to several modules or unit of device in the detailed description above, but this stroke
It point is only exemplary rather than enforceable.In fact, according to an illustrative embodiment of the invention, above-described two or
More multimode/unit feature and function can realize in a module/unit, conversely, an above-described module/mono-
The feature and function of member can be to be realized by multiple module/units with further division.In addition, above-described certain module/
Unit can be omitted under certain application scenarios.
It should be appreciated that determiner " first ", " second " and " third " etc. used in description of the embodiment of the present invention is only used
In clearer elaboration technical solution, can not be used to limit the scope of the invention.
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all in essence of the invention
Within mind and principle, made any modification, equivalent replacement etc. be should all be included in the protection scope of the present invention.
Claims (41)
1. a kind of speech recognition mood exchange method characterized by comprising
Emotion identification result is obtained according to user speech message, wherein audio mood is included at least in the Emotion identification result
Audio Emotion identification result and the text Emotion identification result are included at least in recognition result or the Emotion identification result;
Intention analysis is carried out according to the content of text of the user speech message, obtains corresponding basic intent information;And
Corresponding interactive instruction is determined according to the Emotion identification result and the basic intent information.
2. speech recognition exchange method according to claim 1, which is characterized in that described to be obtained according to user speech message
Emotion identification result includes:
Audio Emotion identification result is obtained according to the audio data of the user speech message;And according to the audio mood
Recognition result determines the Emotion identification result;
Or,
Audio Emotion identification is obtained as a result, and according to the user speech message according to the audio data of the user speech message
Content of text obtain text Emotion identification result;And according to the audio Emotion identification result and the text mood
Recognition result determines the Emotion identification result.
3. speech recognition exchange method according to claim 2, which is characterized in that the audio Emotion identification result includes
One of multiple mood classification are a variety of;Or, the audio Emotion identification result corresponds to a seat in multidimensional emotional space
Punctuate;
Or, the audio Emotion identification result and the text Emotion identification result respectively include one in multiple mood classification
Kind is a variety of;Or, the audio Emotion identification result and the text Emotion identification result respectively correspond multidimensional emotional space
In a coordinate points;
Wherein, the emotional factor that the corresponding psychology of each dimension in the multidimensional emotional space defines.
4. speech recognition exchange method according to claim 3, which is characterized in that the audio Emotion identification result and institute
It states text Emotion identification result and respectively includes one of the multiple mood classification or a variety of;
Wherein, described that the Emotion identification knot is determined according to the audio Emotion identification result and the text Emotion identification result
Fruit includes:
If the audio Emotion identification result and the text Emotion identification result include identical mood classification, will be described
Identical mood classification is used as the Emotion identification result.
5. speech recognition exchange method according to claim 4, which is characterized in that described according to the audio Emotion identification
As a result and the text Emotion identification result determines that the Emotion identification result further comprises:
If the audio Emotion identification result and the text Emotion identification result do not include identical mood classification, will
The audio Emotion identification result and the text Emotion identification result are collectively as the Emotion identification result.
6. speech recognition exchange method according to claim 2, which is characterized in that the audio Emotion identification result and institute
It states text Emotion identification result and respectively includes one of the multiple mood classification or a variety of;
Wherein, described that the Emotion identification is determined according to the audio Emotion identification result and the text Emotion identification result
Result includes:
Calculate mood in the confidence level and the text Emotion identification result that mood is classified in the audio Emotion identification result
The confidence level of classification;
When in the audio Emotion identification result the highest mood classification of confidence level in the text Emotion identification result
When the highest mood of confidence level classifies identical, by the highest mood of confidence level in the audio Emotion identification result point
The highest mood classification of confidence level is used as the Emotion identification result in class or the text Emotion identification result.
7. speech recognition exchange method according to claim 6, which is characterized in that when in the audio Emotion identification result
The highest mood classification of confidence level and the highest mood of confidence level in the text Emotion identification result are classified not phase
Meanwhile described the Emotion identification result is determined according to the audio Emotion identification result and the text Emotion identification result
Further comprise:
According to the confidence level and the text mood of the highest mood classification of confidence level in the audio Emotion identification result
The size relation of the confidence level of the highest mood classification of confidence level, determines Emotion identification result in recognition result.
8. speech recognition exchange method according to claim 7, which is characterized in that described according to the audio Emotion identification
As a result the highest institute of confidence level in the confidence level and the text Emotion identification result of the highest mood classification of middle confidence level
The size relation for stating the confidence level of mood classification, determines that Emotion identification result includes:
When the confidence level of the highest mood classification of confidence level in the audio Emotion identification result is greater than the text mood
In recognition result when the confidence level of the highest mood classification of confidence level, most by confidence level in the audio Emotion identification result
High mood classification is used as the Emotion identification result;And
When the confidence level of the highest mood classification of confidence level in the audio Emotion identification result is equal to the text mood
In recognition result when the confidence level of the highest mood classification of confidence level, most by confidence level in the audio Emotion identification result
High mood classification is as the Emotion identification as a result, or by the highest institute of confidence level in the audio Emotion identification result
The highest mood classification of confidence level in mood classification and the text Emotion identification result is stated to know collectively as the mood
Other result;
The confidence level according to the highest mood classification of confidence level in the audio Emotion identification result and the text
The size relation of the confidence level of the highest mood classification of confidence level, determines Emotion identification result into one in Emotion identification result
Step includes:
When the confidence level of the highest mood classification of confidence level in the audio Emotion identification result is less than the text mood
In recognition result when the confidence level of the highest mood classification of confidence level, judge whether wrap in the audio Emotion identification result
The highest mood classification of confidence level in the text Emotion identification result is included;And
If it is judged that be it is yes, then further judge the text Emotion identification result in the audio Emotion identification result
Whether the emotional intensity rank of the middle highest mood classification of confidence level is greater than the first intensity threshold, if it is greater than described
First intensity threshold then knows the highest mood classification of confidence level in the text Emotion identification result as the mood
Other result;If the judging result or the result further judged be it is no, will be in the audio Emotion identification result
The highest mood classification of confidence level is as the Emotion identification as a result, or by confidence level in the audio Emotion identification result
The highest mood classification of confidence level is collectively as institute in the highest mood classification and the text Emotion identification result
State Emotion identification result.
9. speech recognition exchange method according to claim 7, which is characterized in that described according to the text Emotion identification
As a result the highest institute of confidence level in the confidence level and the audio Emotion identification result of the highest mood classification of middle confidence level
The size relation for stating the confidence level of mood classification, determines that Emotion identification result includes:
When the confidence level of the highest mood classification of confidence level in the text Emotion identification result is greater than the audio mood
In recognition result when the confidence level of the highest mood classification of confidence level, most by confidence level in the text Emotion identification result
High mood classification is used as the Emotion identification result;And
When the confidence level of the highest mood classification of confidence level in the text Emotion identification result is equal to the audio mood
In recognition result when the confidence level of the highest mood classification of confidence level, most by confidence level in the text Emotion identification result
High mood classification is as the Emotion identification as a result, or by the highest institute of confidence level in the text Emotion identification result
The highest mood classification of confidence level in mood classification and the audio Emotion identification result is stated to know collectively as the mood
Other result;
The confidence level and the audio according to the highest mood classification of confidence level in the text Emotion identification result
The size relation of the confidence level of the highest mood classification of confidence level, determines Emotion identification result into one in Emotion identification result
Step includes:
When the confidence level of the highest mood classification of confidence level in the text Emotion identification result is less than the audio mood
In recognition result when the confidence level of the highest mood classification of confidence level, judge whether wrap in the text Emotion identification result
The highest mood classification of confidence level in the audio Emotion identification result is included;And
If it is judged that be it is yes, then further judge the audio Emotion identification result in the text Emotion identification result
Whether the emotional intensity rank of the middle highest mood classification of confidence level is greater than the second intensity threshold, if it is greater than described
Second intensity threshold then knows the highest mood classification of confidence level in the audio Emotion identification result as the mood
Other result;If the judging result or the result further judged be it is no, will be in the text Emotion identification result
The highest mood classification of confidence level is as the Emotion identification as a result, or by confidence level in the text Emotion identification result
The highest mood classification of confidence level is collectively as institute in the highest mood classification and the audio Emotion identification result
State Emotion identification result.
10. speech recognition exchange method according to claim 2, which is characterized in that the audio Emotion identification result and
The text Emotion identification result respectively corresponds a coordinate points in multidimensional emotional space;
It is wherein described that the Emotion identification knot is determined according to the audio Emotion identification result and the text Emotion identification result
Fruit includes:
By the coordinate points of the audio Emotion identification result and the text Emotion identification result in the multidimensional emotional space
Coordinate value be weighted and averaged processing, obtained coordinate points are as the Emotion identification knot after the weighted average is handled
Fruit.
11. speech recognition exchange method according to claim 1, which is characterized in that described to be disappeared according to the user speech
The content of text of breath obtains text Emotion identification result
It identifies the mood vocabulary in the content of text of the user speech message, determines first according to the mood vocabulary identified
Text Emotion identification result;
The content of text of the user speech message is inputted into text Emotion identification deep learning model, the text Emotion identification
Deep learning model is established based on being trained to the content of text for including mood tag along sort and emotional intensity grade distinguishing label,
Using the output result of the text Emotion identification deep learning model as the second text Emotion identification result;And
Determine that the text mood is known according to the first text Emotion identification result and the second text Emotion identification result
Other result.
12. speech recognition exchange method according to claim 11, which is characterized in that the first text Emotion identification knot
Fruit includes one of multiple mood classification or a variety of;Or, the first text Emotion identification result corresponds to multidimensional emotional space
In a coordinate points;
Or, the first text Emotion identification result and the second text Emotion identification result respectively include multiple moods point
One of class is a variety of;Or, the first text Emotion identification result and the second text Emotion identification result difference
A coordinate points in corresponding multidimensional emotional space;
Wherein, the emotional factor that the corresponding psychology of each dimension in the multidimensional emotional space defines.
13. speech recognition exchange method according to claim 11, which is characterized in that the first text Emotion identification knot
Fruit and the second text Emotion identification result respectively include one of the multiple mood classification or a variety of;
It is wherein described that the text is determined according to the first text Emotion identification result and the second text Emotion identification result
This Emotion identification result includes:
If the first text Emotion identification result and the second text Emotion identification result include identical mood classification,
Then it regard the identical mood classification as the text Emotion identification result.
14. speech recognition exchange method according to claim 13, which is characterized in that described according to the first text feelings
Thread recognition result and the second text Emotion identification result determine that the text Emotion identification result further comprises:
If the first text Emotion identification result and the second text Emotion identification result do not include identical mood
Classification, then by the first text Emotion identification result and the second text Emotion identification result collectively as the text feelings
Thread recognition result.
15. speech recognition exchange method according to claim 13, which is characterized in that the first text Emotion identification knot
Fruit and the second text Emotion identification result respectively include one of the multiple mood classification or a variety of;
Wherein, described that the text is determined according to the first text Emotion identification result and the second text Emotion identification result
This Emotion identification result includes:
Calculate the confidence level and the second text Emotion identification knot that mood is classified in the first text Emotion identification result
The confidence level that mood is classified in fruit;
When the highest mood classification of confidence level in the first text Emotion identification result and the second text mood knowledge
When the highest mood of confidence level classifies identical in other result, by confidence level highest in the first text Emotion identification result
Mood classification or the second text Emotion identification result in the highest mood classification of confidence level be used as the text
This Emotion identification result.
16. speech recognition exchange method according to claim 15, which is characterized in that when the first text Emotion identification
As a result the highest mood classification of middle confidence level and the highest feelings of confidence level in the second text Emotion identification result
It is described true according to the first text Emotion identification result and the second text Emotion identification result when thread classification is not identical
The fixed text Emotion identification result further comprises:
According to the confidence level and described second of the highest mood classification of confidence level in the first text Emotion identification result
The size relation of the confidence level of the highest mood classification of confidence level, determines Emotion identification knot in text Emotion identification result
Fruit.
17. speech recognition exchange method according to claim 16, which is characterized in that described according to the first text feelings
Confidence in the confidence level and the second text Emotion identification result of the highest mood classification of confidence level in thread recognition result
The size relation for spending the confidence level of the highest mood classification, determines that Emotion identification result includes:
When the confidence level of the highest mood classification of confidence level in the first text Emotion identification result is greater than described second
In text Emotion identification result when the confidence level of the highest mood classification of confidence level, by the first text Emotion identification knot
The highest mood classification of confidence level is used as the text Emotion identification result in fruit;And
When the confidence level of the highest mood classification of confidence level in the first text Emotion identification result is equal to described second
In text Emotion identification result when the confidence level of the highest mood classification of confidence level, by the first text Emotion identification knot
The highest mood classification of confidence level is as the text Emotion identification as a result, or by the first text Emotion identification in fruit
As a result the highest feelings of confidence level in the highest mood classification of middle confidence level and the second text Emotion identification result
Thread is classified collectively as the text Emotion identification result;
Wherein, the confidence level according to the highest mood classification of confidence level in the first text Emotion identification result with
The size relation of the confidence level of the highest mood classification of confidence level, determines mood in the second text Emotion identification result
Recognition result further comprises:
When the confidence level of the highest mood classification of confidence level in the first text Emotion identification result is less than described second
In text Emotion identification result when the confidence level of the highest mood classification of confidence level, the first text Emotion identification is judged
It as a result whether include the highest mood classification of confidence level in the second text Emotion identification result in;And
If it is judged that be it is yes, then further judge the second text mood in the first text Emotion identification result
Whether the emotional intensity rank of the highest mood classification of confidence level is greater than the first intensity threshold in recognition result, if
Then the highest mood classification of confidence level in the second text Emotion identification result is made greater than first intensity threshold
For the text Emotion identification result;If the judging result or the result further judged be it is no, by described
The highest mood classification of confidence level is as the text Emotion identification as a result, or will be described in one text Emotion identification result
It is set in the highest mood classification of confidence level and the second text Emotion identification result in first text Emotion identification result
The highest mood classification of reliability is collectively as the text Emotion identification result.
18. speech recognition exchange method according to claim 16, which is characterized in that described according to the first text feelings
Confidence in the confidence level and the second text Emotion identification result of the highest mood classification of confidence level in thread recognition result
The size relation for spending the confidence level of the highest mood classification, determines that Emotion identification result includes:
When the confidence level of the highest mood classification of confidence level in the second text Emotion identification result is greater than described first
In text Emotion identification result when the confidence level of the highest mood classification of confidence level, by the second text Emotion identification knot
The highest mood classification of confidence level is used as the text Emotion identification result in fruit;And
When the confidence level of the highest mood classification of confidence level in the second text Emotion identification result is equal to described first
In text Emotion identification result when the confidence level of the highest mood classification of confidence level, by the second text Emotion identification knot
The highest mood classification of confidence level is as the text Emotion identification as a result, or by the second text Emotion identification in fruit
As a result the highest feelings of confidence level in the highest mood classification of middle confidence level and the first text Emotion identification result
Thread is classified collectively as the text Emotion identification result;
When the confidence level of the highest mood classification of confidence level in the second text Emotion identification result is less than described first
In text Emotion identification result when the confidence level of the highest mood classification of confidence level, the second text Emotion identification is judged
It as a result whether include the highest mood classification of confidence level in the first text Emotion identification result in;And
If it is judged that be it is yes, then further judge the first text mood in the second text Emotion identification result
Whether the emotional intensity rank of the highest mood classification of confidence level is greater than the second intensity threshold in recognition result, if
Then the highest mood classification of confidence level in the first text Emotion identification result is made greater than second intensity threshold
For the text Emotion identification result;If the judging result or the result further judged be it is no, by described
The highest mood classification of confidence level is as the text Emotion identification as a result, or will be described in two text Emotion identification results
It is set in the highest mood classification of confidence level and the first text Emotion identification result in second text Emotion identification result
The highest mood classification of reliability is collectively as the text Emotion identification result.
19. speech recognition exchange method according to claim 12, which is characterized in that the first text Emotion identification knot
Fruit and the second text Emotion identification result respectively correspond a coordinate points in the multidimensional emotional space;
It is wherein described that the text is determined according to the first text Emotion identification result and the second text Emotion identification result
This Emotion identification result includes:
By the first text Emotion identification result and the second text Emotion identification result in the multidimensional emotional space
The coordinate values of coordinate points be weighted and averaged processing, the coordinate points obtained after the weighted average is handled are as the text
Emotion identification result.
20. a kind of speech recognition interactive device characterized by comprising
Emotion identification module is configured to obtain Emotion identification result according to user speech message, wherein the Emotion identification result
In include at least audio Emotion identification as a result, or including at least audio Emotion identification result and described in the Emotion identification result
Text Emotion identification result;
Basic intention assessment module, is configured to carry out intention analysis according to the content of text of the user speech message, obtains pair
The basic intent information answered;And
Interactive instruction determining module is configured to determine corresponding friendship according to the Emotion identification result and the basic intent information
Mutually instruction.
21. speech recognition exchange method according to claim 20, which is characterized in that the Emotion identification module is further
Be configured to include:
Audio Emotion identification unit is configured to obtain audio Emotion identification knot according to the audio data of the user speech message
Fruit;And Emotion identification result determination unit, it is configured to determine the Emotion identification knot according to the audio Emotion identification result
Fruit;
Or,
Audio Emotion identification unit is configured to obtain audio Emotion identification knot according to the audio data of the user speech message
Fruit;Text Emotion identification unit is configured to obtain text Emotion identification result according to the content of text of the user speech message;
And Emotion identification result determination unit, it is configured to according to the audio Emotion identification result and the text Emotion identification
As a result the Emotion identification result is determined.
22. speech recognition interactive device according to claim 21, which is characterized in that the audio Emotion identification result packet
Include one of multiple mood classification or a variety of;Or, the audio Emotion identification result corresponds to one in multidimensional emotional space
Coordinate points;
Or, the audio Emotion identification result and the text Emotion identification result respectively include one in multiple mood classification
Kind is a variety of;Or, the audio Emotion identification result and the text Emotion identification result respectively correspond multidimensional emotional space
In a coordinate points;
Wherein, the emotional factor that the corresponding psychology of each dimension in the multidimensional emotional space defines.
23. speech recognition interactive device according to claim 21, which is characterized in that the audio Emotion identification result and
The text Emotion identification result respectively includes one of the multiple mood classification or a variety of;
Wherein the Emotion identification result determination unit is further configured to:
If the audio Emotion identification result and the text Emotion identification result include identical mood classification, will be described
Identical mood classification is used as the Emotion identification result.
24. speech recognition interactive device according to claim 22, which is characterized in that the Emotion identification result determines single
Member is further configured to:
If the audio Emotion identification result and the text Emotion identification result do not include identical mood classification, will
The audio Emotion identification result and the text Emotion identification result are collectively as the Emotion identification result.
25. speech recognition interactive device according to claim 21, which is characterized in that the audio Emotion identification result and
The text Emotion identification result respectively includes one of the multiple mood classification or a variety of;
Wherein, the Emotion identification result determination unit includes:
First confidence calculations subelement, be configured to calculate confidence level that mood in the audio Emotion identification result is classified and
The confidence level that mood is classified in the text Emotion identification result;
Emotion identification subelement is configured to when the highest mood classification of confidence level and institute in the audio Emotion identification result
State the highest mood classification of confidence level in text Emotion identification result it is identical when, will be set in the audio Emotion identification result
The highest mood classification of confidence level is used as institute in the highest mood classification of reliability or the text Emotion identification result
State Emotion identification result.
26. speech recognition interactive device according to claim 25, which is characterized in that the Emotion identification subelement is into one
Step is configured to, when the highest mood classification of confidence level in the audio Emotion identification result and the text Emotion identification knot
When the highest mood of confidence level classifies not identical in fruit, according to the highest institute of confidence level in the audio Emotion identification result
State the confidence level of the highest mood classification of confidence level in the confidence level and the text Emotion identification result of mood classification
Size relation determines Emotion identification result.
27. speech recognition interactive device according to claim 26, which is characterized in that the Emotion identification subelement is into one
Step is configured to, when the confidence level of the highest mood classification of confidence level in the audio Emotion identification result is greater than the text
In Emotion identification result when the confidence level of the highest mood classification of confidence level, by confidence in the audio Emotion identification result
It spends the highest mood classification and is used as the Emotion identification result;And
When the confidence level of the highest mood classification of confidence level in the audio Emotion identification result is equal to the text mood
In recognition result when the confidence level of the highest mood classification of confidence level, most by confidence level in the audio Emotion identification result
High mood classification is as the Emotion identification as a result, or by the highest institute of confidence level in the audio Emotion identification result
The highest mood classification of confidence level in mood classification and the text Emotion identification result is stated to know collectively as the mood
Other result;
When the confidence level of the highest mood classification of confidence level in the audio Emotion identification result is less than the text mood
In recognition result when the confidence level of the highest mood classification of confidence level, judge whether wrap in the audio Emotion identification result
The highest mood classification of confidence level in the text Emotion identification result is included;And
If it is judged that be it is yes, then further judge the text Emotion identification result in the audio Emotion identification result
Whether the emotional intensity rank of the middle highest mood classification of confidence level is greater than the first intensity threshold, if it is greater than described
First intensity threshold then knows the highest mood classification of confidence level in the text Emotion identification result as the mood
Other result;If the judging result or the result further judged be it is no, will be in the audio Emotion identification result
The highest mood classification of confidence level is as the Emotion identification as a result, or by confidence level in the audio Emotion identification result
The highest mood classification of confidence level is collectively as institute in the highest mood classification and the text Emotion identification result
State Emotion identification result.
28. speech recognition interactive device according to claim 26, which is characterized in that the Emotion identification subelement is into one
Step is configured to,
When the confidence level of the highest mood classification of confidence level in the text Emotion identification result is greater than the audio mood
In recognition result when the confidence level of the highest mood classification of confidence level, most by confidence level in the text Emotion identification result
High mood classification is used as the Emotion identification result;And
When the confidence level of the highest mood classification of confidence level in the text Emotion identification result is equal to the audio mood
In recognition result when the confidence level of the highest mood classification of confidence level, most by confidence level in the text Emotion identification result
High mood classification is as the Emotion identification as a result, or by the highest institute of confidence level in the text Emotion identification result
The highest mood classification of confidence level in mood classification and the audio Emotion identification result is stated to know collectively as the mood
Other result;
When the confidence level of the highest mood classification of confidence level in the text Emotion identification result is less than the audio mood
In recognition result when the confidence level of the highest mood classification of confidence level, judge whether wrap in the text Emotion identification result
The highest mood classification of confidence level in the audio Emotion identification result is included;And
If it is judged that be it is yes, then further judge the audio Emotion identification result in the text Emotion identification result
Whether the emotional intensity rank of the middle highest mood classification of confidence level is greater than the second intensity threshold, if it is greater than described
Second intensity threshold then knows the highest mood classification of confidence level in the audio Emotion identification result as the mood
Other result;If the judging result or the result further judged be it is no, will be in the text Emotion identification result
The highest mood classification of confidence level is as the Emotion identification as a result, or by confidence level in the text Emotion identification result
The highest mood classification of confidence level is collectively as institute in the highest mood classification and the audio Emotion identification result
State Emotion identification result.
29. speech recognition interactive device according to claim 21, which is characterized in that the audio Emotion identification result and
The text Emotion identification result respectively corresponds a coordinate points in the multidimensional emotional space;
Wherein the Emotion identification result determination unit is further configured to:
By the coordinate points of the audio Emotion identification result and the text Emotion identification result in the multidimensional emotional space
Coordinate value be weighted and averaged processing, obtained coordinate points are as the Emotion identification knot after the weighted average is handled
Fruit.
30. speech recognition interactive device according to claim 20, which is characterized in that the text Emotion identification unit packet
It includes:
Vocabulary Emotion identification subelement is configured to identify the mood vocabulary in the content of text of the user speech message, according to
The mood vocabulary identified determines the first text Emotion identification result;
Deep learning Emotion identification subelement is configured to the content of text of the user speech message inputting text Emotion identification
Deep learning model, the text Emotion identification deep learning model are based on to including mood tag along sort and emotional intensity rank
The content of text of label is trained and establishes, using the output result of the text Emotion identification deep learning model as second
Text Emotion identification result;And
Text mood determines subelement, is configured to be known according to the first text Emotion identification result and the second text mood
Other result determines the text Emotion identification result.
31. speech recognition interactive device according to claim 30, which is characterized in that the first text Emotion identification knot
Fruit includes one of multiple mood classification or a variety of;Or, the first text Emotion identification result corresponds to multidimensional emotional space
In a coordinate points;
Or, the first text Emotion identification result and the second text Emotion identification result respectively include multiple moods point
One of class is a variety of;Or, the first text Emotion identification result and the second text Emotion identification result difference
A coordinate points in corresponding multidimensional emotional space;
Wherein, the emotional factor that the corresponding psychology of each dimension in the multidimensional emotional space defines.
32. speech recognition interactive device according to claim 30, which is characterized in that the first text Emotion identification knot
Fruit and the second text Emotion identification result respectively include one of the multiple mood classification or a variety of;
Wherein the text mood determines that subelement is further configured to:
If the first text Emotion identification result and the second text Emotion identification result include identical mood classification,
Then it regard the identical mood classification as the text Emotion identification result.
33. speech recognition interactive device according to claim 32, which is characterized in that the text mood determines subelement
It is further configured to:
If the first text Emotion identification result and the second text Emotion identification result do not include identical mood
Classification, then by the first text Emotion identification result and the second text Emotion identification result collectively as the text feelings
Thread recognition result.
34. speech recognition interactive device according to claim 32, which is characterized in that the first text Emotion identification knot
Fruit and the second text Emotion identification result respectively include one of the multiple mood classification or a variety of;
Wherein, the text mood determines that subelement includes:
Second confidence calculations subelement is configured to calculate the confidence level that mood is classified in the first text Emotion identification result
And the confidence level that mood is classified in the second text Emotion identification result;
Text emotion judgment subelement is configured to when the highest mood of confidence level in the first text Emotion identification result
When classification is identical as the highest mood classification of confidence level in the second text Emotion identification result, by first text
Confidence level highest in the highest mood classification of confidence level or the second text Emotion identification result in Emotion identification result
The mood classification be used as the text Emotion identification result.
35. speech recognition interactive device according to claim 34, which is characterized in that the text emotion judgment subelement
It is further configured to:
When the highest mood classification of confidence level in the first text Emotion identification result and the second text mood knowledge
When the highest mood of confidence level classifies not identical in other result, according to confidence level in the first text Emotion identification result
The highest mood of confidence level point in the confidence level and the second text Emotion identification result of the highest mood classification
The size relation of the confidence level of class determines Emotion identification result.
36. speech recognition interactive device according to claim 35, which is characterized in that the text emotion judgment subelement
It is further configured to: when the confidence level of the highest mood classification of confidence level in the first text Emotion identification result is greater than
In the second text Emotion identification result when confidence level of the highest mood classification of confidence level, by the first text feelings
The highest mood classification of confidence level is used as the text Emotion identification result in thread recognition result;And
When the confidence level of the highest mood classification of confidence level in the first text Emotion identification result is equal to described second
In text Emotion identification result when the confidence level of the highest mood classification of confidence level, by the first text Emotion identification knot
The highest mood classification of confidence level is as the text Emotion identification as a result, or by the first text Emotion identification in fruit
As a result the highest feelings of confidence level in the highest mood classification of middle confidence level and the second text Emotion identification result
Thread is classified collectively as the text Emotion identification result.
37. speech recognition interactive device according to claim 35, which is characterized in that text emotion judgment subelement is into one
Step is configured to, when the confidence level of the highest mood classification of confidence level in the first text Emotion identification result is less than described
In second text Emotion identification result when the confidence level of the highest mood classification of confidence level, the first text mood is judged
It whether include the highest mood classification of confidence level in the second text Emotion identification result in recognition result;And
If it is judged that be it is yes, then further judge the second text mood in the first text Emotion identification result
Whether the emotional intensity rank of the highest mood classification of confidence level is greater than the first intensity threshold in recognition result, if
Then the highest mood classification of confidence level in the second text Emotion identification result is made greater than first intensity threshold
For the text Emotion identification result;If the judging result or the result further judged be it is no, by described
The highest mood classification of confidence level is as the text Emotion identification as a result, or will be described in one text Emotion identification result
It is set in the highest mood classification of confidence level and the second text Emotion identification result in first text Emotion identification result
The highest mood classification of reliability is collectively as the text Emotion identification result.
38. speech recognition interactive device according to claim 31, which is characterized in that text emotion judgment subelement is into one
Step is configured to, when the confidence level of the highest mood classification of confidence level in the second text Emotion identification result is greater than described
In first text Emotion identification result when the confidence level of the highest mood classification of confidence level, the second text mood is known
The highest mood classification of confidence level is used as the text Emotion identification result in other result;And
When the confidence level of the highest mood classification of confidence level in the second text Emotion identification result is equal to described first
In text Emotion identification result when the confidence level of the highest mood classification of confidence level, by the second text Emotion identification knot
The highest mood classification of confidence level is as the text Emotion identification as a result, or by the second text Emotion identification in fruit
As a result the highest feelings of confidence level in the highest mood classification of middle confidence level and the first text Emotion identification result
Thread is classified collectively as the text Emotion identification result.
When the confidence level of the highest mood classification of confidence level in the second text Emotion identification result is less than described first
In text Emotion identification result when the confidence level of the highest mood classification of confidence level, the second text Emotion identification is judged
It as a result whether include the highest mood classification of confidence level in the first text Emotion identification result in;And
If it is judged that be it is yes, then further judge the first text mood in the second text Emotion identification result
Whether the emotional intensity rank of the highest mood classification of confidence level is greater than the second intensity threshold in recognition result, if
Then the highest mood classification of confidence level in the first text Emotion identification result is made greater than second intensity threshold
For the text Emotion identification result;If the judging result or the result further judged be it is no, by described
The highest mood classification of confidence level is as the text Emotion identification as a result, or will be described in two text Emotion identification results
It is set in the highest mood classification of confidence level and the first text Emotion identification result in second text Emotion identification result
The highest mood classification of reliability is collectively as the text Emotion identification result.
39. speech recognition interactive device according to claim 31, which is characterized in that the first text Emotion identification knot
Fruit and the second text Emotion identification result respectively correspond a coordinate points in the multidimensional emotional space;
Wherein the text mood determines that subelement is further configured to:
By the first text Emotion identification result and the second text Emotion identification result in the multidimensional emotional space
The coordinate values of coordinate points be weighted and averaged processing, the coordinate points obtained after the weighted average is handled are as the text
Emotion identification result.
40. a kind of computer equipment, including memory, processor and being stored on the memory is executed by the processor
Computer program, which is characterized in that the processor is realized when executing the computer program as appointed in claim 1 to 19
The step of one the method.
41. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the computer program
It realizes when being executed by processor such as the step of any one of claims 1 to 19 the method.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810079431.XA CN110085211B (en) | 2018-01-26 | 2018-01-26 | Voice recognition interaction method and device, computer equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810079431.XA CN110085211B (en) | 2018-01-26 | 2018-01-26 | Voice recognition interaction method and device, computer equipment and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110085211A true CN110085211A (en) | 2019-08-02 |
CN110085211B CN110085211B (en) | 2021-06-29 |
Family
ID=67412751
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810079431.XA Active CN110085211B (en) | 2018-01-26 | 2018-01-26 | Voice recognition interaction method and device, computer equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110085211B (en) |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110910903A (en) * | 2019-12-04 | 2020-03-24 | 深圳前海微众银行股份有限公司 | Speech emotion recognition method, device, equipment and computer readable storage medium |
CN111106995A (en) * | 2019-12-26 | 2020-05-05 | 腾讯科技(深圳)有限公司 | Message display method, device, terminal and computer readable storage medium |
CN111273990A (en) * | 2020-01-21 | 2020-06-12 | 腾讯科技(深圳)有限公司 | Information interaction method and device, computer equipment and storage medium |
CN111401198A (en) * | 2020-03-10 | 2020-07-10 | 广东九联科技股份有限公司 | Audience emotion recognition method, device and system |
CN111833907A (en) * | 2020-01-08 | 2020-10-27 | 北京嘀嘀无限科技发展有限公司 | Man-machine interaction method, terminal and computer readable storage medium |
CN112151034A (en) * | 2020-10-14 | 2020-12-29 | 珠海格力电器股份有限公司 | Voice control method and device of equipment, electronic equipment and storage medium |
CN112951233A (en) * | 2021-03-30 | 2021-06-11 | 平安科技(深圳)有限公司 | Voice question and answer method and device, electronic equipment and readable storage medium |
CN113743126A (en) * | 2021-11-08 | 2021-12-03 | 北京博瑞彤芸科技股份有限公司 | Intelligent interaction method and device based on user emotion |
CN114125492A (en) * | 2022-01-24 | 2022-03-01 | 阿里巴巴(中国)有限公司 | Live content generation method and device |
WO2023239562A1 (en) * | 2022-06-06 | 2023-12-14 | Cerence Operating Company | Emotion-aware voice assistant |
Family Cites Families (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102142253B (en) * | 2010-01-29 | 2013-05-29 | 富士通株式会社 | Voice emotion identification equipment and method |
CN103543979A (en) * | 2012-07-17 | 2014-01-29 | 联想(北京)有限公司 | Voice outputting method, voice interaction method and electronic device |
US9196248B2 (en) * | 2013-02-13 | 2015-11-24 | Bayerische Motoren Werke Aktiengesellschaft | Voice-interfaced in-vehicle assistance |
WO2016169594A1 (en) * | 2015-04-22 | 2016-10-27 | Longsand Limited | Web technology responsive to mixtures of emotions |
CN105427869A (en) * | 2015-11-02 | 2016-03-23 | 北京大学 | Session emotion autoanalysis method based on depth learning |
CN106910512A (en) * | 2015-12-18 | 2017-06-30 | 株式会社理光 | The analysis method of voice document, apparatus and system |
CN106503646B (en) * | 2016-10-19 | 2020-07-10 | 竹间智能科技(上海)有限公司 | Multi-mode emotion recognition system and method |
CN106570496B (en) * | 2016-11-22 | 2019-10-01 | 上海智臻智能网络科技股份有限公司 | Emotion identification method and apparatus and intelligent interactive method and equipment |
CN106776557B (en) * | 2016-12-13 | 2020-09-08 | 竹间智能科技(上海)有限公司 | Emotional state memory identification method and device of emotional robot |
CN107562816B (en) * | 2017-08-16 | 2021-02-09 | 苏州狗尾草智能科技有限公司 | Method and device for automatically identifying user intention |
-
2018
- 2018-01-26 CN CN201810079431.XA patent/CN110085211B/en active Active
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110910903B (en) * | 2019-12-04 | 2023-03-21 | 深圳前海微众银行股份有限公司 | Speech emotion recognition method, device, equipment and computer readable storage medium |
CN110910903A (en) * | 2019-12-04 | 2020-03-24 | 深圳前海微众银行股份有限公司 | Speech emotion recognition method, device, equipment and computer readable storage medium |
CN111106995A (en) * | 2019-12-26 | 2020-05-05 | 腾讯科技(深圳)有限公司 | Message display method, device, terminal and computer readable storage medium |
CN111833907A (en) * | 2020-01-08 | 2020-10-27 | 北京嘀嘀无限科技发展有限公司 | Man-machine interaction method, terminal and computer readable storage medium |
CN111273990A (en) * | 2020-01-21 | 2020-06-12 | 腾讯科技(深圳)有限公司 | Information interaction method and device, computer equipment and storage medium |
CN111401198A (en) * | 2020-03-10 | 2020-07-10 | 广东九联科技股份有限公司 | Audience emotion recognition method, device and system |
CN111401198B (en) * | 2020-03-10 | 2024-04-23 | 广东九联科技股份有限公司 | Audience emotion recognition method, device and system |
CN112151034A (en) * | 2020-10-14 | 2020-12-29 | 珠海格力电器股份有限公司 | Voice control method and device of equipment, electronic equipment and storage medium |
CN112951233A (en) * | 2021-03-30 | 2021-06-11 | 平安科技(深圳)有限公司 | Voice question and answer method and device, electronic equipment and readable storage medium |
CN113743126A (en) * | 2021-11-08 | 2021-12-03 | 北京博瑞彤芸科技股份有限公司 | Intelligent interaction method and device based on user emotion |
CN114125492B (en) * | 2022-01-24 | 2022-07-15 | 阿里巴巴(中国)有限公司 | Live content generation method and device |
CN114125492A (en) * | 2022-01-24 | 2022-03-01 | 阿里巴巴(中国)有限公司 | Live content generation method and device |
WO2023138508A1 (en) * | 2022-01-24 | 2023-07-27 | 阿里巴巴(中国)有限公司 | Livestreaming content generation method and device |
WO2023239562A1 (en) * | 2022-06-06 | 2023-12-14 | Cerence Operating Company | Emotion-aware voice assistant |
Also Published As
Publication number | Publication date |
---|---|
CN110085211B (en) | 2021-06-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11373641B2 (en) | Intelligent interactive method and apparatus, computer device and computer readable storage medium | |
CN110085221A (en) | Speech emotional exchange method, computer equipment and computer readable storage medium | |
CN110085262A (en) | Voice mood exchange method, computer equipment and computer readable storage medium | |
CN110085211A (en) | Speech recognition exchange method, device, computer equipment and storage medium | |
US11010645B2 (en) | Interactive artificial intelligence analytical system | |
CN110085220A (en) | Intelligent interaction device | |
Jing et al. | Prominence features: Effective emotional features for speech emotion recognition | |
Wu et al. | Automatic speech emotion recognition using modulation spectral features | |
Bisio et al. | Gender-driven emotion recognition through speech signals for ambient intelligence applications | |
Narayanan et al. | Behavioral signal processing: Deriving human behavioral informatics from speech and language | |
Bone et al. | Robust unsupervised arousal rating: A rule-based framework withknowledge-inspired vocal features | |
Busso et al. | Iterative feature normalization scheme for automatic emotion detection from speech | |
McKechnie et al. | Automated speech analysis tools for children’s speech production: A systematic literature review | |
US11354754B2 (en) | Generating self-support metrics based on paralinguistic information | |
Levitan et al. | Combining Acoustic-Prosodic, Lexical, and Phonotactic Features for Automatic Deception Detection. | |
Al-Dujaili et al. | Speech emotion recognition: a comprehensive survey | |
Sethu et al. | Speech based emotion recognition | |
Hema et al. | Emotional speech recognition using cnn and deep learning techniques | |
CN109935241A (en) | Voice information processing method | |
Kapoor et al. | Fusing traditionally extracted features with deep learned features from the speech spectrogram for anger and stress detection using convolution neural network | |
CN113853651A (en) | Apparatus and method for speech-emotion recognition using quantized emotional states | |
Devillers et al. | Automatic detection of emotion from vocal expression | |
Wusu-Ansah | Emotion recognition from speech: An implementation in MATLAB | |
Kalatzantonakis-Jullien et al. | Investigation and ordinal modelling of vocal features for stress detection in speech | |
Chang | Speech Analysis Methodologies towards Unobtrusive Mental Health Monitoring |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
PE01 | Entry into force of the registration of the contract for pledge of patent right |
Denomination of invention: Speech recognition interaction methods, devices, computer equipment, and storage media Granted publication date: 20210629 Pledgee: Agricultural Bank of China Limited Shanghai pilot Free Trade Zone New Area Branch Pledgor: SHANGHAI XIAOI ROBOT TECHNOLOGY Co.,Ltd. Registration number: Y2024310000244 |
|
PE01 | Entry into force of the registration of the contract for pledge of patent right |