CN106503646A

CN106503646A - Multi-modal emotion identification system and method

Info

Publication number: CN106503646A
Application number: CN201610912302.5A
Authority: CN
Inventors: 简仁贤; 杨闵淳; 林志豪; 孙廷伟
Original assignee: Intelligent Technology (shanghai) Co Ltd
Current assignee: Intelligent Technology (shanghai) Co Ltd
Priority date: 2016-10-19
Filing date: 2016-10-19
Publication date: 2017-03-15
Anticipated expiration: 2036-10-19
Also published as: CN106503646B

Abstract

The present invention provides a kind of multi-modal emotion identification system and method, and wherein, system includes, voice receiver, and the first emotion recognizes subsystem, and the second emotion recognizes subsystem, and visual pattern receptor, the 3rd emotion recognize subsystem, emotion follower；Voice receiver, for receiving the voice signal that destination object sends；Visual pattern receptor, for receiving the visual image data with regard to destination object；First emotion recognizes subsystem, for obtaining the first emotion identification result according to voice signal；Second emotion recognizes subsystem, for obtaining the second emotion identification result according to voice signal；3rd emotion recognizes subsystem, for obtaining the 3rd emotion identification result according to visual image data；Emotion follower, for according to the first emotion identification result, the second emotion identification result and the 3rd emotion identification result determine the affective state of destination object.

Description

Multi-modal emotion identification system and method

Technical field

The present invention relates to computer processing technology, more particularly to a kind of multi-modal emotion identification system and method.

Background technology

At present, emotion identification machine is generally by adopting character recognition technology, speech recognition technology or visual pattern identification A certain technology in technology is recognized to the emotion of the mankind, and this emotion identification mode is single, carries out emotion identification when institute Using quantity of information few, it is difficult to realize the human emotion's identification under complicated situation.

Content of the invention

The technical problem to be solved in the present invention is to provide a kind of multi-modal emotion identification system and method, merges Text region Technology, speech recognition technology and visual pattern technology of identification, while human emotion's identification is carried out from multiple passages, so that emotion Identification machine can precisely recognize the emotion of destination object in interactive process.

For solving above-mentioned technical problem, the technical scheme that the present invention is provided is：

On the one hand, the present invention provides a kind of multi-modal emotion identification system, including：Voice receiver, the first emotion are recognized Subsystem, the second emotion recognize subsystem, and visual pattern receptor, the 3rd emotion recognize subsystem, emotion follower；Voice connects Device is received, for receiving the voice signal that destination object sends；Visual pattern receptor, for receiving the vision with regard to destination object View data；First emotion recognizes subsystem, for obtaining the first emotion identification result according to voice signal；Second emotion is recognized Subsystem, for obtaining the second emotion identification result according to voice signal；3rd emotion recognizes subsystem, for according to vision figure As the 3rd emotion identification result of data acquisition；Emotion follower, for according to the first emotion identification result, the second emotion identification knot Fruit and the 3rd emotion identification result determine the affective state of destination object.

Further, the first emotion identification subsystem, specifically includes, emotion significance dispenser, the first emotion identifier； Emotion significance dispenser, extracts acoustics prosodic features for the voice signal to voice receiver；First emotion identifier, uses In the first emotion identification result for obtaining voice signal according to acoustics prosodic features；Second emotion recognizes subsystem, specifically includes, Speech recognition device, sentence characteristics extraction device, the second emotion identifier；Speech recognition device, for by the voice of voice receiver Signal is converted into word sequence；Sentence characteristics extraction device, for extracting the sentence eigenvalue in word sequence；Second emotion is distinguished Know device, for the second emotion identification result of voice signal is obtained according to sentence eigenvalue；3rd emotion recognizes subsystem, specifically Including, recognition of face tracker, human bioequivalence tracker, facial expression feature extractor, body action feature extractor, the 3rd Emotion identifier；Recognition of face tracker, for recognizing and tracking the human face data in visual image data；Human bioequivalence is tracked Device, for recognize and track in visual image data including the whole somatic data including head；Facial expression feature is extracted Device, for extracting the face's key point in human face data, and obtains facial expression feature value according to face's key point；Body action Feature extractor, for extracting the body action key point in somatic data, and it is dynamic to obtain body according to body action key point Make eigenvalue；3rd emotion identifier, for obtaining visual pattern number according to facial expression feature value and body action eigenvalue According to the 3rd emotion recognition result；Emotion follower, for according to the first emotion identification result, the second emotion identification result and Three emotion identification results, and the Psychology and behavior mapping relations collection of illustrative plates for building in advance, determine the affective state of destination object.

Further, the first emotion identifier, according to the first emotion identification knot that acoustics prosodic features obtain voice signal Really, specifically include, the first emotion identifier, acoustics prosodic features are substituted in the class brain machine learning model for building in advance, with Class nerve phonetic feature is obtained, and class nerve phonetic feature is substituted into the emotion model for prestoring, to obtain voice signal The first emotion recognition confidence level corresponding to first emotion and the first emotion.

Further, acoustics prosodic features include, pitch, intensity, tonequality, sound spectrum, cepstrum, and linear perception predicts cepstrum system Number, r.m.s. intensity, zero-crossing rate, spectrum stream, spectral centroid, bandwidth, frequency spectrum business, frequency spectrum flatness, frequency spectrum gradient, frequency spectrum Sharp degree, sound colourity, spectral decay point, spectrum slope, single-frequency overtone, sound probability, acoustic resonance peak, voice climb a little, frequently Spectrum envelope.

Further, sentence characteristics extraction device, extracts the sentence eigenvalue in word sequence, specifically includes, to word Sequence carries out word segmentation processing to obtain participle eigenvalue, carries out word category analysiss to obtain word category feature to word sequence Value, and carry out sentence pattern syntactic analysis to obtain sentence pattern syntactic feature value to word sequence；Second emotion identifier, according to sentence Eigenvalue obtains the second emotion identification result of voice signal, specifically includes, the second emotion identifier, by sentence eigenvalue Participle eigenvalue, word features of classification and sentence pattern syntactic feature value, in the text emotion identification model that input is built in advance, with Obtain the second emotion recognition confidence level corresponding to second emotion and the second emotion of voice signal.

Further, the 3rd emotion identifier, obtains vision figure according to facial expression feature value and body action eigenvalue As the 3rd emotion recognition result of data, specifically include, the 3rd emotion identifier, will be special to facial expression feature value and body action Value indicative is substituted in the emotion classifiers for building in advance, to obtain corresponding to the 3rd emotion and the 3rd emotion of visual image data The 3rd emotion recognition confidence level.

Further, emotion follower, distinguishes according to the first emotion identification result, the second emotion identification result and the 3rd emotion Know result, and the Psychology and behavior mapping relations collection of illustrative plates for building in advance, determine the affective state of destination object, concrete bag Include, in the first emotion recognition confidence level of the first emotion identification result, the second emotion recognition confidence of the second emotion identification result Degree, in the 3rd emotion recognition confidence level of the 3rd emotion identification result, arbitrary emotion recognition confidence level is more than or equal to setting threshold During value, affective state that the emotion corresponding to this emotion recognition confidence level is judged to destination object；In the first emotion identification knot The first emotion recognition confidence level of fruit, the second emotion recognition confidence level of the second emotion identification result, the 3rd emotion identification result The 3rd emotion recognition confidence level, respectively less than given threshold when, then according to default weight rule, to the first emotion identification result The first emotion, the 3rd emotion of the second emotion of the second emotion identification result and the 3rd emotion identification result calculates emotion respectively Label, to obtain the first affective tag, the second affective tag, the 3rd affective tag；According to the first affective tag, the second emotion mark Sign, the 3rd affective tag, and according to the Psychology and behavior mapping relations collection of illustrative plates for building in advance, determine the emotion of destination object State.

On the other hand, the present invention provides a kind of multi-modal emotion discrimination method, including：Voice receiver receives destination object The voice signal for sending；Visual pattern receptor receives the visual image data with regard to destination object；First emotion recognizes subsystem System obtains the first emotion identification result according to voice signal；Second emotion recognizes subsystem and obtains the second emotion according to voice signal Identification result；3rd emotion recognizes subsystem and obtains the 3rd emotion identification result according to visual image data；Emotion follower root According to the first emotion identification result, the second emotion identification result and the 3rd emotion identification result determine the emotion shape of destination object State.

Further, the first emotion identification subsystem obtains the first emotion identification result according to voice signal, specifically includes, Acoustics prosodic features are extracted to the voice signal of voice receiver；According to the first emotion that acoustics prosodic features obtain voice signal Identification result；Second emotion recognizes subsystem and obtains the second emotion identification result according to voice signal, specifically includes, voice is connect The voice signal for receiving device is converted into word sequence；Extract the sentence eigenvalue in word sequence；Language is obtained according to sentence eigenvalue Second emotion identification result of message number；3rd emotion recognizes subsystem and obtains the 3rd emotion identification knot according to visual image data Really, specifically include, recognize and track the human face data in visual image data；Recognize and track including in visual image data Head is in interior whole somatic data；The face's key point in human face data is extracted, and facial table is obtained according to face's key point Feelings eigenvalue；The body action key point in somatic data is extracted, and body action feature is obtained according to body action key point Value；According to the 3rd emotion recognition result that facial expression feature value and body action eigenvalue obtain visual image data；Emotion Follower determines destination object according to the first emotion identification result, the second emotion identification result and the 3rd emotion identification result Affective state, specifically includes, according to the first emotion identification result, the second emotion identification result and the 3rd emotion identification result, with And the Psychology and behavior mapping relations collection of illustrative plates for building in advance, determine the affective state of destination object.

The present invention provide multi-modal emotion identification system and method, merge character recognition technology, speech recognition technology and Visual pattern technology of identification, while carrying out human emotion's identification from multiple passages, can enable emotion identification machine in people The emotion of destination object is precisely recognized in machine interaction.

Description of the drawings

Fig. 1 is the block diagram of multi-modal emotion identification system provided in an embodiment of the present invention；

Fig. 2 is the another block diagram of multi-modal emotion identification system provided in an embodiment of the present invention；

Fig. 3 is the flow chart of multi-modal emotion discrimination method provided in an embodiment of the present invention；

Fig. 4 is the another flow chart of multi-modal emotion discrimination method provided in an embodiment of the present invention.

Specific embodiment

The present invention is further illustrated below by specific embodiment, it should be understood, however, that, these embodiments are only It is used for specifically describing in more detail, and is not to be construed as limiting the present invention in any form.

Embodiment one

In conjunction with Fig. 1, the multi-modal emotion identification system that the present embodiment is provided, including：Voice receiver 1, the first emotion are distinguished Know subsystem 3, the second emotion recognizes subsystem 4, and visual pattern receptor 2, the 3rd emotion recognize subsystem 5, emotion follower 6；Voice receiver 1, for receiving the voice signal that destination object sends；Visual pattern receptor 2, for receiving with regard to target The visual image data of object；First emotion recognizes subsystem 3, for obtaining the first emotion identification result according to voice signal； Second emotion recognizes subsystem 4, for obtaining the second emotion identification result according to voice signal；3rd emotion recognizes subsystem 5, For obtaining the 3rd emotion identification result according to visual image data；Emotion follower 6, for according to the first emotion identification knot Really, the second emotion identification result and the 3rd emotion identification result determine the affective state of destination object.

Preferably, as illustrated in fig. 2, the first emotion identification subsystem 3, specifically includes, emotion significance dispenser 301, First emotion identifier 302；Emotion significance dispenser 301, extracts the acoustics rhythm for the voice signal to voice receiver 1 Feature；First emotion identifier 302, for obtaining the first emotion identification result of voice signal according to acoustics prosodic features；The Two emotions recognize subsystem 4, specifically include, speech recognition device 401, sentence characteristics extraction device 402, the second emotion identifier 403；Speech recognition device 401, for being converted into word sequence by the voice signal of voice receiver 1；Sentence characteristics extraction device 402, for extracting the sentence eigenvalue in word sequence；Second emotion identifier 403, for obtaining language according to sentence eigenvalue Second emotion identification result of message number；3rd emotion recognizes subsystem 5, specifically includes, recognition of face tracker 501, human body Recognition and tracking device 503, facial expression feature extractor 502, body action feature extractor 504, the 3rd emotion identifier 505； Recognition of face tracker 501, for recognizing and tracking the human face data in visual image data；Human bioequivalence tracker 503, uses In recognize and track in visual image data including the whole somatic data including head；Facial expression feature extractor 502, For extracting the face's key point in human face data, and facial expression feature value is obtained according to face's key point；Body action is special Extractor 504 is levied, and for extracting the body action key point in somatic data, and it is dynamic body to be obtained according to body action key point Make eigenvalue；3rd emotion identifier 505, for obtaining visual pattern according to facial expression feature value and body action eigenvalue 3rd emotion recognition result of data；Emotion follower 6, for according to the first emotion identification result, the second emotion identification result With the 3rd emotion identification result, and the Psychology and behavior mapping relations collection of illustrative plates for building in advance, the emotion of destination object is determined State.

Multi-modal emotion identification system provided in an embodiment of the present invention, merge character recognition technology, speech recognition technology and Visual pattern technology of identification, while carrying out human emotion's identification from multiple passages, can enable emotion identification machine in people The emotion of destination object is precisely recognized in machine interaction.

It should be noted that the Psychology and behavior mapping relations collection of illustrative plates that the present embodiment is referred to, is to be closed according to behavior psychology The relation storehouse that system builds in advance, which is substantially, from mapping relations collection of illustrative plates of the behavior presentation of people to the real feelings of people.

It is further preferred that the first emotion identifier 302, according to the first emotion that acoustics prosodic features obtain voice signal Identification result, specifically includes, the first emotion identifier 302, and acoustics prosodic features are substituted into the class brain machine learning for building in advance In model, to obtain class nerve phonetic feature, and class nerve phonetic feature is substituted into the emotion model for prestoring, to obtain language The first emotion recognition confidence level corresponding to first emotion and the first emotion of message number.

Specifically, acoustics prosodic features include, pitch, intensity, tonequality, sound spectrum, cepstrum, and linear perception predicts cepstrum system Number, r.m.s. intensity, zero-crossing rate, spectrum stream, spectral centroid, bandwidth, frequency spectrum business, frequency spectrum flatness, frequency spectrum gradient, frequency spectrum Sharp degree, sound colourity, spectral decay point, spectrum slope, single-frequency overtone, sound probability, acoustic resonance peak, voice climb a little, frequently Spectrum envelope.

It is further preferred that sentence characteristics extraction device 402, extracts the sentence eigenvalue in word sequence, specifically includes, Word segmentation processing is carried out to word sequence to obtain participle eigenvalue, carries out word category analysiss to obtain word class to word sequence Other eigenvalue, and carry out sentence pattern syntactic analysis to obtain sentence pattern syntactic feature value to word sequence；Second emotion identifier 403, according to the second emotion identification result that sentence eigenvalue obtains voice signal, specifically include, the second emotion identifier 403, The text that participle eigenvalue in sentence eigenvalue, word features of classification and sentence pattern syntactic feature value, input are built in advance In emotion identification model, to obtain the second emotion recognition confidence corresponding to second emotion and the second emotion of voice signal Degree.

It is further preferred that the 3rd emotion identifier 505, obtains according to facial expression feature value and body action eigenvalue 3rd emotion recognition result of visual image data, specifically includes, the 3rd emotion identifier 505, by facial expression feature value and Body action eigenvalue is substituted in the emotion classifiers for building in advance, to obtain the 3rd emotion and the 3rd of visual image data The 3rd emotion recognition confidence level corresponding to emotion.

In the present embodiment, facial expression is combined body action together, to carry out emotion identification to visual image data, can To improve emotion discrimination power.For example, when people squares one's shoulders and lower jaw is kicked up, when smiling, then corresponding emotion is pride, but, If only square one's shoulders and lower jaw is kicked up action, or when only smiling, it is impossible to judge this emotion proud.Additionally, the present embodiment is In conjunction with the psychological study achievement of Paul Ekman, the deep learning model using countenance and body action is come the feelings to people Sense is differentiated.

It is further preferred that emotion follower 6, according to the first emotion identification result, the second emotion identification result and the 3rd Emotion identification result, and the Psychology and behavior mapping relations collection of illustrative plates for building in advance, determine the affective state of destination object, have Body includes, in the first emotion recognition confidence level of the first emotion identification result, the second emotion recognition of the second emotion identification result Confidence level, in the 3rd emotion recognition confidence level of the 3rd emotion identification result, arbitrary emotion recognition confidence level is more than or equal to setting When determining threshold value, affective state that the emotion corresponding to this emotion recognition confidence level is judged to destination object；Distinguish in the first emotion Know the first emotion recognition confidence level of result, the second emotion recognition confidence level of the second emotion identification result, the 3rd emotion are recognized During the 3rd emotion recognition confidence level, respectively less than given threshold as a result, then according to default weight rule, the identification of the first emotion is given 3rd emotion of the first emotion as a result, the second emotion of the second emotion identification result and the 3rd emotion identification result is calculated respectively Affective tag, to obtain the first affective tag, the second affective tag, the 3rd affective tag；According to the first affective tag, the second feelings Sense label, the 3rd affective tag, and according to the Psychology and behavior mapping relations collection of illustrative plates for building in advance, determine destination object Affective state.

Embodiment two

In conjunction with Fig. 3, the embodiment of the present invention provides a kind of multi-modal emotion discrimination method, including：

Step S1：Voice receiver 1 receives the voice signal that destination object sends；

Step S2：Visual pattern receptor 2 receives the visual image data with regard to destination object；

Step S3：First emotion recognizes subsystem 3 and obtains the first emotion identification result according to voice signal；

Step S4：Second emotion recognizes subsystem 4 and obtains the second emotion identification result according to voice signal；

Step S5：3rd emotion recognizes subsystem 5 and obtains the 3rd emotion identification result according to visual image data；

Step S6：Emotion follower 6 is according to the first emotion identification result, the second emotion identification result and the identification of the 3rd emotion As a result the affective state of destination object is determined.

Preferably, as illustrated in fig. 4, the first emotion identification subsystem 3 obtains the first emotion identification knot according to voice signal Really, specifically include,

Step S3.1：Acoustics prosodic features are extracted to the voice signal of voice receiver 1；

Step S3.2：According to the first emotion identification result that acoustics prosodic features obtain voice signal；

Second emotion recognizes subsystem 4 and obtains the second emotion identification result according to voice signal, specifically includes,

Step S4.1：The voice signal of voice receiver 1 is converted into word sequence；

Step S4.2：Extract the sentence eigenvalue in word sequence；

Step S4.3：According to the second emotion identification result that sentence eigenvalue obtains voice signal；

3rd emotion recognizes subsystem 5 and obtains the 3rd emotion identification result according to visual image data, specifically includes,

Step S5.1：Recognize and track the human face data in visual image data；

Step S5.2：Recognize and track in visual image data including the whole somatic data including head；

Step S5.3：The face's key point in human face data is extracted, and facial expression feature is obtained according to face's key point Value；

Step S5.4：The body action key point in somatic data is extracted, and body is obtained according to body action key point Motion characteristic value；

Step S5.5：According to the 3rd emotion that facial expression feature value and body action eigenvalue obtain visual image data Recognition result；

According to the first emotion identification result, the second emotion identification result and the 3rd emotion identification result are true for emotion follower 6 The affective state of destination object is made, is specifically included,

Step S6.1：According to the first emotion identification result, the second emotion identification result and the 3rd emotion identification result, and The Psychology and behavior mapping relations collection of illustrative plates for building in advance, determines the affective state of destination object.

Multi-modal emotion discrimination method provided in an embodiment of the present invention, merge character recognition technology, speech recognition technology and Visual pattern technology of identification, while carrying out human emotion's identification from multiple passages, can enable emotion identification machine in people The emotion of destination object is precisely recognized in machine interaction.

Although present invention has been a certain degree of description, it will be apparent that, without departing from the spirit and scope of the present invention Under the conditions of, the appropriate change of each condition can be carried out.It is appreciated that the invention is not restricted to the embodiment, and it is attributed to right The scope of requirement, it include the equivalent of each factor.

Claims

1. a kind of multi-modal emotion identification system, it is characterised in that include：Voice receiver, the first emotion recognize subsystem, the Two emotions recognize subsystem, and visual pattern receptor, the 3rd emotion recognize subsystem, emotion follower；

The voice receiver, for receiving the voice signal that destination object sends；

The visual pattern receptor, for receiving the visual image data with regard to the destination object；

First emotion recognizes subsystem, for obtaining the first emotion identification result according to the voice signal；

Second emotion recognizes subsystem, for obtaining the second emotion identification result according to the voice signal；

3rd emotion recognizes subsystem, for obtaining the 3rd emotion identification result according to the visual image data；

The emotion follower, for according to the first emotion identification result, the second emotion identification result and described Three emotion identification results determine the affective state of the destination object.

2. multi-modal emotion identification system according to claim 1, it is characterised in that

First emotion recognizes subsystem, specifically includes, emotion significance dispenser, the first emotion identifier；

The emotion significance dispenser, for extracting acoustics prosodic features to the voice signal of the voice receiver；

The first emotion identifier, for obtaining first emotion of the voice signal according to the acoustics prosodic features Identification result；

Second emotion recognizes subsystem, specifically includes, speech recognition device, sentence characteristics extraction device, and the second emotion is recognized Device；

Institute's speech recognizer, for being converted into word sequence by the voice signal of the voice receiver；

The sentence characteristics extraction device, for extracting the sentence eigenvalue in the word sequence；

The second emotion identifier, distinguishes for obtaining second emotion of the voice signal according to the sentence eigenvalue Know result；

3rd emotion recognizes subsystem, specifically includes, recognition of face tracker, human bioequivalence tracker, and facial expression is special Levy extractor, body action feature extractor, the 3rd emotion identifier；

The recognition of face tracker, for recognizing and tracking the human face data in the visual image data；

The human bioequivalence tracker, for recognize and track in the visual image data including the whole people including head Volume data；

The facial expression feature extractor, for extracting the face's key point in the human face data, and according to the face Key point obtains facial expression feature value；

The body action feature extractor, for extracting the body action key point in the somatic data, and according to described Body action key point obtains body action eigenvalue；

The 3rd emotion identifier, described in obtaining according to the facial expression feature value and the body action eigenvalue 3rd emotion recognition result of visual image data；

The emotion follower, for according to the first emotion identification result, the second emotion identification result and described Three emotion identification results, and the Psychology and behavior mapping relations collection of illustrative plates for building in advance, determine the emotion of the destination object State.

3. multi-modal emotion identification system according to claim 2, it is characterised in that the first emotion identifier, root According to the first emotion identification result that the acoustics prosodic features obtain the voice signal, specifically include,

The first emotion identifier, the acoustics prosodic features are substituted in the class brain machine learning model for building in advance, with Class nerve phonetic feature is obtained, and class nerve phonetic feature is substituted into the emotion model for prestoring, the predicate to obtain The first emotion recognition confidence level corresponding to first emotion of message number and first emotion.

4. multi-modal emotion identification system according to claim 3, it is characterised in that the acoustics prosodic features include, Pitch, intensity, tonequality, sound spectrum, cepstrum, linear perception predict cepstrum coefficient, r.m.s. intensity, zero-crossing rate, spectrum stream, frequency spectrum matter The heart, bandwidth, frequency spectrum business, frequency spectrum flatness, frequency spectrum gradient, frequency spectrum point degree, sound colourity, spectral decay point, spectrum slope, Single-frequency overtone, sound probability, acoustic resonance peak, voice climb a little, spectrum envelope.

5. multi-modal emotion identification system according to claim 2, it is characterised in that

The sentence characteristics extraction device, extracts the sentence eigenvalue in the word sequence, specifically includes,

Word segmentation processing is carried out to the word sequence to obtain participle eigenvalue, and word category analysiss are carried out to the word sequence To obtain word features of classification, and carry out sentence pattern syntactic analysis to obtain sentence pattern syntactic feature value to the word sequence；

The second emotion identifier, according to the second emotion identification knot that the sentence eigenvalue obtains the voice signal Really, specifically include,

The second emotion identifier, by the participle eigenvalue in the sentence eigenvalue, the word features of classification In the text emotion identification model built in advance with the sentence pattern syntactic feature value, input, to obtain the of the voice signal The second emotion recognition confidence level corresponding to two emotions and second emotion.

6. multi-modal emotion identification system according to claim 2, it is characterised in that the 3rd emotion identifier, root According to the 3rd emotion recognition knot that the facial expression feature value and the body action eigenvalue obtain the visual image data Really, specifically include,

The facial expression feature value and the body action eigenvalue are substituted into advance structure by the 3rd emotion identifier In emotion classifiers, to obtain the 3rd emotion corresponding to the 3rd emotion and the 3rd emotion of the visual image data Recognition confidence.

7. multi-modal emotion identification system according to any one of claim 1 to 6, it is characterised in that the emotion is defeated Go out device, according to the first emotion identification result, the second emotion identification result and the 3rd emotion identification result, and The Psychology and behavior mapping relations collection of illustrative plates for building in advance, determines the affective state of the destination object, specifically includes,

In the first emotion recognition confidence level of the first emotion identification result, the second emotion of the second emotion identification result Recognition confidence, in the 3rd emotion recognition confidence level of the 3rd emotion identification result, arbitrary emotion recognition confidence level is more than Or when being equal to given threshold, the affective state that the emotion corresponding to this emotion recognition confidence level is judged to the destination object；

In the first emotion recognition confidence level of the first emotion identification result, the second emotion of the second emotion identification result Recognition confidence, during the 3rd emotion recognition confidence level, respectively less than given threshold of the 3rd emotion identification result, then according to pre- If weight rule, to the first emotion of the first emotion identification result, the second emotion of the second emotion identification result Affective tag is calculated respectively with the 3rd emotion of the 3rd emotion identification result, to obtain the first affective tag, the second emotion Label, the 3rd affective tag；

According to first affective tag, second affective tag, the 3rd affective tag, and according to the advance structure The Psychology and behavior mapping relations collection of illustrative plates that builds, determines the affective state of the destination object.

8. a kind of multi-modal emotion discrimination method, it is characterised in that include：

Voice receiver receives the voice signal that destination object sends；

Visual pattern receptor receives the visual image data with regard to the destination object；

First emotion recognizes subsystem and obtains the first emotion identification result according to the voice signal；

Second emotion recognizes subsystem and obtains the second emotion identification result according to the voice signal；

3rd emotion recognizes subsystem and obtains the 3rd emotion identification result according to the visual image data；

Emotion follower is according to the first emotion identification result, the second emotion identification result and the 3rd emotion identification As a result the affective state of the destination object is determined.

9. multi-modal emotion discrimination method according to claim 8, it is characterised in that

The first emotion identification subsystem obtains the first emotion identification result according to the voice signal, specifically includes,

Acoustics prosodic features are extracted to the voice signal of the voice receiver；

According to the first emotion identification result that the acoustics prosodic features obtain the voice signal；

The second emotion identification subsystem obtains the second emotion identification result according to the voice signal, specifically includes,

The voice signal of the voice receiver is converted into word sequence；

Extract the sentence eigenvalue in the word sequence；

According to the second emotion identification result that the sentence eigenvalue obtains the voice signal；

The 3rd emotion identification subsystem obtains the 3rd emotion identification result according to the visual image data, specifically includes,

Recognize and track the human face data in the visual image data；

Recognize and track in the visual image data including the whole somatic data including head；

The face's key point in the human face data is extracted, and facial expression feature value is obtained according to face's key point；

The body action key point in the somatic data is extracted, and it is special body action to be obtained according to the body action key point Value indicative；

According to the 3rd emotion that the facial expression feature value and the body action eigenvalue obtain the visual image data Recognition result；

The emotion follower is according to the first emotion identification result, the second emotion identification result and the 3rd emotion Identification result determines the affective state of the destination object, specifically includes,

According to the first emotion identification result, the second emotion identification result and the 3rd emotion identification result, and The Psychology and behavior mapping relations collection of illustrative plates for building in advance, determines the affective state of the destination object.

10. multi-modal emotion discrimination method according to claim 8 or claim 9, it is characterised in that the acoustics prosodic features bag Include, pitch, intensity, tonequality, sound spectrum, cepstrum, linear perception predicts cepstrum coefficient, r.m.s. intensity, zero-crossing rate, spectrum stream, frequently Spectrum barycenter, bandwidth, frequency spectrum business, frequency spectrum flatness, frequency spectrum gradient, frequency spectrum point degree, sound colourity, spectral decay point, frequency spectrum Slope, single-frequency overtone, sound probability, acoustic resonance peak, voice climb a little, spectrum envelope.