CN106503646B

CN106503646B - Multi-mode emotion recognition system and method

Info

Publication number: CN106503646B
Application number: CN201610912302.5A
Authority: CN
Inventors: 简仁贤; 杨闵淳; 林志豪; 孙廷伟
Original assignee: Emotibot Technologies Ltd
Current assignee: Emotibot Technologies Ltd
Priority date: 2016-10-19
Filing date: 2016-10-19
Publication date: 2020-07-10
Anticipated expiration: 2036-10-19
Also published as: CN106503646A

Abstract

The invention provides a multi-mode emotion recognition system and a method, wherein the system comprises a voice receiver, a first emotion recognition subsystem, a second emotion recognition subsystem, a visual image receiver, a third emotion recognition subsystem and an emotion output device; the voice receiver is used for receiving a voice signal sent by a target object; a visual image receiver for receiving visual image data about a target object; the first emotion recognition subsystem is used for acquiring a first emotion recognition result according to the voice signal; the second emotion recognition subsystem is used for acquiring a second emotion recognition result according to the voice signal; the third emotion recognition subsystem is used for acquiring a third emotion recognition result according to the visual image data; and the emotion outputter is used for determining the emotional state of the target object according to the first emotion recognition result, the second emotion recognition result and the third emotion recognition result.

Description

Multi-mode emotion recognition system and method

Technical Field

The invention relates to a computer processing technology, in particular to a multi-mode emotion recognition system and a method.

Background

At present, an emotion recognition machine usually recognizes human emotion by adopting one of a character recognition technology, a voice recognition technology or a visual image recognition technology, the emotion recognition mode is single, the amount of information adopted in emotion recognition is small, and human emotion recognition under a complex situation is difficult to realize.

Disclosure of Invention

The invention provides a multi-mode emotion recognition system and method, which are used for combining a character recognition technology, a voice recognition technology and a visual image recognition technology and simultaneously performing human emotion recognition from a plurality of channels so that an emotion recognition machine can accurately recognize the emotion of a target object in a human-computer interaction process.

In order to solve the technical problems, the technical scheme provided by the invention is as follows:

in one aspect, the present invention provides a multi-modal emotion recognition system, including: the system comprises a voice receiver, a first emotion recognition subsystem, a second emotion recognition subsystem, a visual image receiver, a third emotion recognition subsystem and an emotion output device; the voice receiver is used for receiving a voice signal sent by a target object; a visual image receiver for receiving visual image data about a target object; the first emotion recognition subsystem is used for acquiring a first emotion recognition result according to the voice signal; the second emotion recognition subsystem is used for acquiring a second emotion recognition result according to the voice signal; the third emotion recognition subsystem is used for acquiring a third emotion recognition result according to the visual image data; and the emotion outputter is used for determining the emotional state of the target object according to the first emotion recognition result, the second emotion recognition result and the third emotion recognition result.

Further, the first emotion recognition subsystem specifically comprises an emotion significance divider and a first emotion recognizer; the emotion significance decollator is used for extracting acoustic prosodic features from a voice signal of the voice receiver; the first emotion recognizer is used for acquiring a first emotion recognition result of the voice signal according to the acoustic prosodic features; the second emotion recognition subsystem specifically comprises a voice recognizer, a sentence characteristic value extractor and a second emotion recognizer; a speech recognizer for converting a speech signal of the speech receiver into a sequence of words; a sentence characteristic value extractor for extracting a sentence characteristic value in the character sequence; the second emotion recognizer is used for acquiring a second emotion recognition result of the voice signal according to the statement characteristic value; the third emotion recognition subsystem specifically comprises a face recognition tracker, a human body recognition tracker, a facial expression feature extractor, a body action feature extractor and a third emotion recognizer; the face recognition tracker is used for recognizing and tracking face data in the visual image data; a human body recognition tracker for recognizing and tracking the entire human body data including the head in the visual image data; the facial expression feature extractor is used for extracting facial key points in the face data and acquiring facial expression feature values according to the facial key points; the body motion characteristic extractor is used for extracting body motion key points in the human body data and acquiring body motion characteristic values according to the body motion key points; the third emotion recognizer is used for acquiring a third emotion recognition result of the visual image data according to the facial expression characteristic value and the body action characteristic value; and the emotion outputter is used for determining the emotional state of the target object according to the first emotion recognition result, the second emotion recognition result, the third emotion recognition result and a pre-constructed psychology and behavior mapping relation map.

The first emotion recognizer is used for substituting the acoustic prosodic features into a pre-constructed brain-like machine learning model to obtain nerve-like voice features, and substituting the nerve-like voice features into a pre-stored emotion model to obtain a first emotion of the voice signal and a first emotion recognition confidence corresponding to the first emotion.

Further, the acoustic rhythm characteristics include pitch, intensity, timbre, sound spectrum, cepstrum, linear perceptual prediction cepstrum coefficient, root-mean-square intensity, zero crossing rate, spectral flow, spectral centroid, frequency bandwidth, spectral quotient, spectral flatness, spectral slope, spectral sharpness, sound chromaticity, spectral attenuation point, spectral slope, single-frequency overtone, sound probability, sound formant, voice climbing point, and spectral envelope.

Further, the sentence characteristic value extractor extracts the sentence characteristic values in the character sequence, and specifically comprises the steps of performing word segmentation processing on the character sequence to obtain word segmentation characteristic values, performing word category analysis on the character sequence to obtain word category characteristic values, and performing sentence pattern syntactic analysis on the character sequence to obtain sentence pattern syntactic characteristic values; and the second emotion recognizer is used for inputting the word segmentation characteristic value, the word category characteristic value and the sentence pattern syntactic characteristic value in the sentence characteristic value into a pre-constructed text emotion recognition model so as to obtain a second emotion of the voice signal and a second emotion recognition confidence corresponding to the second emotion.

Further, the third emotion recognizer obtains a third emotion recognition result of the visual image data according to the facial expression feature value and the body motion feature value, and specifically includes the third emotion recognizer, substitutes the facial expression feature value and the body motion feature value into a pre-constructed emotion classifier, so as to obtain a third emotion of the visual image data and a third emotion recognition confidence corresponding to the third emotion.

Further, the emotion output device determines the emotional state of the target object according to the first emotion recognition result, the second emotion recognition result, the third emotion recognition result and a pre-constructed psychology and behavior mapping relation map, and specifically includes that when any one of a first emotion recognition confidence of the first emotion recognition result, a second emotion recognition confidence of the second emotion recognition result and a third emotion recognition confidence of the third emotion recognition result is greater than or equal to a set threshold, the emotion corresponding to the emotion recognition confidence is determined as the emotional state of the target object; when the first emotion recognition confidence of the first emotion recognition result, the second emotion recognition confidence of the second emotion recognition result and the third emotion recognition confidence of the third emotion recognition result are all smaller than a set threshold value, respectively calculating emotion labels for the first emotion of the first emotion recognition result, the second emotion of the second emotion recognition result and the third emotion of the third emotion recognition result according to a preset weight rule to obtain a first emotion label, a second emotion label and a third emotion label; and determining the emotional state of the target object according to the first emotional tag, the second emotional tag and the third emotional tag and according to a pre-constructed psychology and behavior mapping relation atlas.

In another aspect, the present invention provides a multi-modal emotion recognition method, including: the voice receiver receives a voice signal sent by a target object; a visual image receiver receiving visual image data about a target object; the first emotion recognition subsystem acquires a first emotion recognition result according to the voice signal; the second emotion recognition subsystem acquires a second emotion recognition result according to the voice signal; the third emotion recognition subsystem acquires a third emotion recognition result according to the visual image data; and the emotion output device determines the emotional state of the target object according to the first emotion identification result, the second emotion identification result and the third emotion identification result.

Further, the first emotion recognition subsystem acquires a first emotion recognition result according to the voice signal, and specifically comprises the steps of extracting acoustic prosody features from the voice signal of the voice receiver; acquiring a first emotion recognition result of the voice signal according to the acoustic rhythm characteristics; the second emotion recognition subsystem acquires a second emotion recognition result according to the voice signal, and specifically comprises the steps of converting the voice signal of the voice receiver into a character sequence; extracting a sentence characteristic value in the character sequence; acquiring a second emotion recognition result of the voice signal according to the sentence characteristic value; the third emotion recognition subsystem acquires a third emotion recognition result according to the visual image data, and specifically comprises the steps of recognizing and tracking face data in the visual image data; recognizing and tracking the whole human body data including the head in the visual image data; extracting face key points in the face data, and acquiring facial expression characteristic values according to the face key points; extracting body action key points in the human body data, and acquiring body action characteristic values according to the body action key points; acquiring a third emotion recognition result of the visual image data according to the facial expression characteristic value and the body action characteristic value; the emotion follower determines the emotional state of the target object according to the first emotion recognition result, the second emotion recognition result and the third emotion recognition result, and specifically comprises the step of determining the emotional state of the target object according to the first emotion recognition result, the second emotion recognition result and the third emotion recognition result and a pre-constructed psychology and behavior mapping relation map.

The multi-mode emotion recognition system and method provided by the invention are integrated with a character recognition technology, a voice recognition technology and a visual image recognition technology, and human emotion recognition is carried out from a plurality of channels, so that an emotion recognition machine can accurately recognize the emotion of a target object in a man-machine interaction process.

Drawings

FIG. 1 is a block diagram of a multi-modal emotion recognition system provided by an embodiment of the present invention;

FIG. 2 is another block diagram of a multi-modal emotion recognition system provided by an embodiment of the present invention;

FIG. 3 is a flowchart of a multi-modal emotion recognition method provided by an embodiment of the present invention;

FIG. 4 is a flowchart of a multi-modal emotion recognition method according to an embodiment of the present invention.

Detailed Description

The invention is further illustrated by the following specific examples, which, however, are to be construed as merely illustrative, and not limitative of the remainder of the disclosure in any way whatsoever.

Example one

With reference to fig. 1, the multi-modal emotion recognition system provided in this embodiment includes: the system comprises a voice receiver 1, a first emotion recognition subsystem 3, a second emotion recognition subsystem 4, a visual image receiver 2, a third emotion recognition subsystem 5 and an emotion output device 6; a voice receiver 1 for receiving a voice signal emitted from a target object; a visual image receiver 2 for receiving visual image data on a target object; the first emotion recognition subsystem 3 is used for acquiring a first emotion recognition result according to the voice signal; the second emotion recognition subsystem 4 is used for acquiring a second emotion recognition result according to the voice signal; the third emotion recognition subsystem 5 is used for acquiring a third emotion recognition result according to the visual image data; and the emotion outputter 6 is used for determining the emotional state of the target object according to the first emotion recognition result, the second emotion recognition result and the third emotion recognition result.

Preferably, as shown in fig. 2, the first emotion recognition subsystem 3 specifically includes an emotion saliency divider 301, a first emotion recognizer 302; an emotion saliency divider 301 for extracting acoustic prosody features from the speech signal of the speech receiver 1; a first emotion recognizer 302, configured to obtain a first emotion recognition result of the speech signal according to the acoustic prosody feature; the second emotion recognition subsystem 4 specifically includes a speech recognizer 401, a sentence feature value extractor 402, and a second emotion recognizer 403; a speech recognizer 401 for converting the speech signal of the speech receiver 1 into a sequence of words; a sentence characteristic value extractor 402 for extracting a sentence characteristic value in the text sequence; a second emotion recognizer 403, configured to obtain a second emotion recognition result of the speech signal according to the sentence feature value; the third emotion recognition subsystem 5 specifically includes a face recognition tracker 501, a human body recognition tracker 503, a facial expression feature extractor 502, a body action feature extractor 504, and a third emotion recognizer 505; a face recognition tracker 501 for recognizing and tracking face data in the visual image data; a human body recognition tracker 503 for recognizing and tracking the entire human body data including the head in the visual image data; a facial expression feature extractor 502, configured to extract facial key points in the face data, and obtain facial expression feature values according to the facial key points; a body motion feature extractor 504, configured to extract body motion key points in the human body data, and obtain a body motion feature value according to the body motion key points; a third emotion recognizer 505, configured to obtain a third emotion recognition result of the visual image data according to the facial expression feature value and the body motion feature value; and the emotion outputter 6 is used for determining the emotional state of the target object according to the first emotion recognition result, the second emotion recognition result, the third emotion recognition result and a pre-constructed psychology and behavior mapping relation map.

The multi-mode emotion recognition system provided by the embodiment of the invention integrates a character recognition technology, a voice recognition technology and a visual image recognition technology, and simultaneously carries out human emotion recognition from a plurality of channels, so that an emotion recognition machine can accurately recognize the emotion of a target object in the human-computer interaction process.

It should be noted that the psychology mapping relationship map mentioned in this embodiment is a relationship library pre-constructed according to behavioral psychology relationships, and is substantially a mapping relationship map from a human behavioral representation to a human real emotion.

Further preferably, the first emotion recognizer 302 obtains the first emotion recognition result of the speech signal according to the acoustic prosody feature, and specifically includes that the first emotion recognizer 302 substitutes the acoustic prosody feature into a pre-constructed brain-like machine learning model to obtain a neural-like speech feature, and substitutes the neural-like speech feature into a pre-stored emotion model to obtain the first emotion of the speech signal and the first emotion recognition confidence corresponding to the first emotion.

Specifically, the acoustic prosodic features include pitch, intensity, timbre, sound spectrum, cepstrum, linear perceptual prediction cepstrum coefficient, root-mean-square intensity, zero crossing rate, spectral flow, spectral centroid, frequency bandwidth, spectral quotient, spectral flatness, spectral slope, spectral sharpness, sound chromaticity, spectral attenuation point, spectral slope, single-frequency overtone, sound probability, sound formant, speech climbing point, spectral envelope.

Further preferably, the sentence characteristic value extractor 402 extracts the sentence characteristic values in the text sequence, specifically including performing word segmentation processing on the text sequence to obtain word segmentation characteristic values, performing word category analysis on the text sequence to obtain word category characteristic values, and performing sentence pattern syntactic analysis on the text sequence to obtain sentence pattern syntactic characteristic values; the second emotion recognizer 403 obtains a second emotion recognition result of the speech signal according to the sentence feature value, and specifically includes the second emotion recognizer 403 inputting the participle feature value, the word category feature value, and the sentence pattern syntactic feature value in the sentence feature value into a pre-constructed text emotion recognition model to obtain a second emotion of the speech signal and a second emotion recognition confidence corresponding to the second emotion.

Further preferably, the third emotion recognizer 505 obtains a third emotion recognition result of the visual image data according to the facial expression feature value and the body motion feature value, and specifically includes the third emotion recognizer 505, which substitutes the facial expression feature value and the body motion feature value into a pre-constructed emotion classifier to obtain a third emotion of the visual image data and a third emotion recognition confidence corresponding to the third emotion.

In this embodiment, the facial expressions are combined with body movements to perform emotion recognition on the visual image data, so that the emotion recognition rate can be improved. For example, when a person is standing out of the chest, the lower jaw is raised, and the face belt is smiling, the corresponding feeling is pride, but when only the chest and lower jaw are raised, or only smile is generated, the feeling of pride cannot be judged. In addition, the present embodiment combines the psychological research results of Paul Ekman, and adopts a deep learning model of facial expressions and body movements to distinguish human emotions.

Further preferably, the emotion follower 6 determines the emotional state of the target object according to the first emotion recognition result, the second emotion recognition result, the third emotion recognition result and the pre-constructed psychology mapping relationship map, and specifically includes that when any one of the first emotion recognition confidence of the first emotion recognition result, the second emotion recognition confidence of the second emotion recognition result and the third emotion recognition confidence of the third emotion recognition result is greater than or equal to a set threshold, the emotion corresponding to the emotion recognition confidence is determined as the emotional state of the target object; when the first emotion recognition confidence of the first emotion recognition result, the second emotion recognition confidence of the second emotion recognition result and the third emotion recognition confidence of the third emotion recognition result are all smaller than a set threshold value, respectively calculating emotion labels for the first emotion of the first emotion recognition result, the second emotion of the second emotion recognition result and the third emotion of the third emotion recognition result according to a preset weight rule to obtain a first emotion label, a second emotion label and a third emotion label; and determining the emotional state of the target object according to the first emotional tag, the second emotional tag and the third emotional tag and according to a pre-constructed psychology and behavior mapping relation atlas.

Example two

With reference to fig. 3, an embodiment of the present invention provides a multi-modal emotion recognition method, including:

step S1: the voice receiver 1 receives a voice signal sent by a target object;

step S2: the visual image receiver 2 receives visual image data on a target object;

step S3: the first emotion recognition subsystem 3 acquires a first emotion recognition result according to the voice signal;

step S4: the second emotion recognition subsystem 4 acquires a second emotion recognition result according to the voice signal;

step S5: the third emotion recognition subsystem 5 acquires a third emotion recognition result according to the visual image data;

step S6: and the emotion outputter 6 determines the emotional state of the target object according to the first emotion recognition result, the second emotion recognition result and the third emotion recognition result.

Preferably, as shown in fig. 4, the first emotion recognition subsystem 3 obtains a first emotion recognition result according to the voice signal, specifically including,

step S3.1: extracting acoustic prosodic features from a speech signal of the speech receiver 1;

step S3.2: acquiring a first emotion recognition result of the voice signal according to the acoustic rhythm characteristics;

the second emotion recognition subsystem 4 obtains a second emotion recognition result according to the voice signal, which specifically includes,

step S4.1: converting the speech signal of the speech receiver 1 into a sequence of words;

step S4.2: extracting a sentence characteristic value in the character sequence;

step S4.3: acquiring a second emotion recognition result of the voice signal according to the sentence characteristic value;

the third emotion recognition subsystem 5 obtains a third emotion recognition result according to the visual image data, and specifically includes,

step S5.1: recognizing and tracking face data in the visual image data;

step S5.2: recognizing and tracking the whole human body data including the head in the visual image data;

step S5.3: extracting face key points in the face data, and acquiring facial expression characteristic values according to the face key points;

step S5.4: extracting body action key points in the human body data, and acquiring body action characteristic values according to the body action key points;

step S5.5: acquiring a third emotion recognition result of the visual image data according to the facial expression characteristic value and the body action characteristic value;

the emotion output unit 6 determines the emotional state of the target object according to the first emotion recognition result, the second emotion recognition result and the third emotion recognition result, which specifically includes,

step S6.1: and determining the emotional state of the target object according to the first emotion recognition result, the second emotion recognition result, the third emotion recognition result and a pre-constructed psychology and behavior mapping relation map.

The multi-mode emotion recognition method provided by the embodiment of the invention integrates a character recognition technology, a voice recognition technology and a visual image recognition technology, and simultaneously carries out human emotion recognition from a plurality of channels, so that an emotion recognition machine can accurately recognize the emotion of a target object in the human-computer interaction process.

Although the present invention has been described to a certain extent, it is apparent that appropriate changes in the respective conditions may be made without departing from the spirit and scope of the present invention. It is to be understood that the invention is not limited to the described embodiments, but is to be accorded the scope consistent with the claims, including equivalents of each element described.

Claims

1. A multi-modal emotion recognition system, comprising: the system comprises a voice receiver, a first emotion recognition subsystem, a second emotion recognition subsystem, a visual image receiver, a third emotion recognition subsystem and an emotion output device;

the voice receiver is used for receiving a voice signal sent by a target object;

the visual image receiver is used for receiving visual image data about the target object;

the first emotion recognition subsystem is used for acquiring a first emotion recognition result according to the voice signal;

the second emotion recognition subsystem is used for acquiring a second emotion recognition result according to the voice signal;

the third emotion recognition subsystem is used for acquiring a third emotion recognition result according to the visual image data;

the emotion outputter is used for determining the emotional state of the target object according to the first emotion recognition result, the second emotion recognition result and the third emotion recognition result;

the first emotion recognition subsystem specifically comprises an emotion significance divider and a first emotion recognizer;

the emotion significance decollator is used for extracting acoustic prosodic features from the voice signal of the voice receiver;

the first emotion recognizer is used for acquiring the first emotion recognition result of the voice signal according to the acoustic prosodic feature;

the second emotion recognition subsystem specifically comprises a voice recognizer, a sentence characteristic value extractor and a second emotion recognizer;

the voice recognizer is used for converting the voice signal of the voice receiver into a character sequence;

the sentence characteristic value extractor is used for extracting the sentence characteristic values in the character sequence;

the second emotion recognizer is used for acquiring a second emotion recognition result of the voice signal according to the statement feature value;

the third emotion recognition subsystem specifically comprises a face recognition tracker, a human body recognition tracker, a facial expression feature extractor, a body action feature extractor and a third emotion recognizer;

the face recognition tracker is used for recognizing and tracking face data in the visual image data;

the human body identification tracker is used for identifying and tracking the whole human body data including the head in the visual image data;

the facial expression feature extractor is used for extracting facial key points in the face data and acquiring facial expression feature values according to the facial key points;

the body motion characteristic extractor is used for extracting body motion key points in the human body data and acquiring body motion characteristic values according to the body motion key points;

the third emotion recognizer is used for acquiring a third emotion recognition result of the visual image data according to the facial expression characteristic value and the body motion characteristic value;

the emotion follower is used for determining the emotional state of the target object according to the first emotion recognition result, the second emotion recognition result, the third emotion recognition result and a pre-constructed psychology mapping relation map, wherein the psychology mapping relation map is a relation library pre-constructed according to behavioral psychology relations, and is essentially a mapping relation map from the behavioral expression of a person to the real emotion of the person;

the sentence characteristic value extractor extracts the sentence characteristic values in the character sequence, specifically comprising,

performing word segmentation processing on the character sequence to obtain word segmentation characteristic values, performing word category analysis on the character sequence to obtain word category characteristic values, and performing sentence pattern syntactic analysis on the character sequence to obtain sentence pattern syntactic characteristic values;

the second emotion recognizer obtains the second emotion recognition result of the voice signal according to the sentence characteristic value, which specifically comprises,

the second emotion recognizer inputs the word segmentation feature values, the word category feature values and the sentence pattern syntactic feature values in the sentence feature values into a pre-constructed text emotion recognition model to acquire a second emotion of the voice signal and a second emotion recognition confidence corresponding to the second emotion;

the first emotion recognizer obtains the first emotion recognition result of the voice signal according to the acoustic prosodic feature, and specifically comprises,

and the first emotion recognizer substitutes the acoustic prosodic features into a pre-constructed brain-like machine learning model to acquire nerve-like voice features, and substitutes the nerve-like voice features into a pre-stored emotion model to acquire a first emotion of the voice signal and a first emotion recognition confidence corresponding to the first emotion.

2. The system of claim 1, wherein the acoustic prosodic features include pitch, intensity, timbre, sound spectrum, cepstrum, linear perceptual prediction cepstral coefficients, root-mean-square intensity, zero crossing rate, spectral flow, spectral centroid, frequency bandwidth, spectral quotient, spectral flatness, spectral slope, spectral sharpness, sound chromaticity, spectral attenuation point, spectral slope, single frequency overtone, sound probability, sound formant, speech climb point, spectral envelope.

3. The multi-modal emotion recognition system of claim 1, wherein the third emotion recognizer obtains a third emotion recognition result of the visual image data based on the facial expression feature value and the body motion feature value, and in particular comprises,

and the third emotion recognizer substitutes the facial expression characteristic value and the body action characteristic value into a pre-constructed emotion classifier to obtain a third emotion of the visual image data and a third emotion recognition confidence corresponding to the third emotion.

4. The multimodal emotion recognition system of any of claims 1 to 3, wherein the emotion follower determines the emotional state of the target object based on the first emotion recognition result, the second emotion recognition result, the third emotion recognition result, and a pre-constructed psycho-behavioral mapping relation map, and in particular comprises,

judging the emotion corresponding to the emotion recognition confidence coefficient as the emotion state of the target object when any one of the first emotion recognition confidence coefficient of the first emotion recognition result, the second emotion recognition confidence coefficient of the second emotion recognition result and the third emotion recognition confidence coefficient of the third emotion recognition result is greater than or equal to a set threshold;

when the first emotion recognition confidence of the first emotion recognition result, the second emotion recognition confidence of the second emotion recognition result and the third emotion recognition confidence of the third emotion recognition result are all smaller than a set threshold value, respectively calculating emotion labels for the first emotion of the first emotion recognition result and the second emotion of the second emotion recognition result and the third emotion of the third emotion recognition result according to a preset weight rule to obtain a first emotion label, a second emotion label and a third emotion label;

and determining the emotional state of the target object according to the first emotional label, the second emotional label, the third emotional label and the pre-constructed psychology and behavior mapping relation atlas.

5. A multi-modal emotion recognition method, comprising:

the voice receiver receives a voice signal sent by a target object;

a visual image receiver receiving visual image data about the target object;

the first emotion recognition subsystem acquires a first emotion recognition result according to the voice signal;

the second emotion recognition subsystem acquires a second emotion recognition result according to the voice signal;

the third emotion recognition subsystem acquires a third emotion recognition result according to the visual image data;

the emotion output device determines the emotional state of the target object according to the first emotion recognition result, the second emotion recognition result and the third emotion recognition result;

the first emotion recognition subsystem acquires a first emotion recognition result according to the voice signal, and specifically comprises,

extracting acoustic prosodic features from the speech signal of the speech receiver;

acquiring the first emotion recognition result of the voice signal according to the acoustic prosodic feature;

the second emotion recognition subsystem acquires a second emotion recognition result according to the voice signal, and specifically comprises,

converting the speech signal of the speech receiver into a sequence of words;

extracting a sentence characteristic value in the character sequence;

acquiring the second emotion recognition result of the voice signal according to the statement characteristic value;

the third emotion recognition subsystem acquires a third emotion recognition result according to the visual image data, and specifically comprises,

recognizing and tracking face data in the visual image data;

identifying and tracking whole human body data including a head in the visual image data;

extracting face key points in the face data, and acquiring facial expression characteristic values according to the face key points;

extracting body action key points in the human body data, and acquiring body action characteristic values according to the body action key points;

acquiring a third emotion recognition result of the visual image data according to the facial expression characteristic value and the body action characteristic value;

the emotion follower determines the emotional state of the target object according to the first emotion recognition result, the second emotion recognition result and the third emotion recognition result, which specifically comprises,

determining the emotional state of the target object according to the first emotion recognition result, the second emotion recognition result, the third emotion recognition result and a pre-constructed psychology and behavior mapping relation atlas; the psychology and behavior mapping relation map is a relation library constructed in advance according to behavior and psychology relations, and the essence of the psychology and behavior mapping relation map is a mapping relation map from behavior appearance of a person to real emotion of the person;

the extracting of the sentence characteristic values in the text sequence specifically comprises,

obtaining the second emotion recognition result of the speech signal according to the sentence characteristic value, specifically including,

inputting the word segmentation characteristic values, the word category characteristic values and the sentence pattern syntactic characteristic values in the sentence characteristic values into a pre-constructed text emotion recognition model to acquire a second emotion of the voice signal and a second emotion recognition confidence corresponding to the second emotion;

obtaining the first emotion recognition result of the voice signal according to the acoustic prosody feature, specifically comprising,

substituting the acoustic prosodic features into a pre-constructed brain-like machine learning model to obtain nerve-like voice features, and substituting the nerve-like voice features into a pre-stored emotion model to obtain a first emotion of the voice signal and a first emotion recognition confidence corresponding to the first emotion;

the acoustic rhythm characteristics comprise pitch, intensity, tone quality, sound spectrum, cepstrum, linear perception prediction cepstrum coefficient, root-mean-square intensity, zero crossing rate, frequency spectrum flow, frequency spectrum mass center, frequency bandwidth, frequency spectrum quotient, frequency spectrum flatness, frequency spectrum inclination, frequency spectrum sharpness, sound chromaticity, frequency spectrum attenuation point, frequency spectrum inclination, single-frequency overtone, sound probability, sound formant, voice climbing point and frequency spectrum envelope.