CN106503646B - Multi-mode emotion recognition system and method - Google Patents
Multi-mode emotion recognition system and method Download PDFInfo
- Publication number
- CN106503646B CN106503646B CN201610912302.5A CN201610912302A CN106503646B CN 106503646 B CN106503646 B CN 106503646B CN 201610912302 A CN201610912302 A CN 201610912302A CN 106503646 B CN106503646 B CN 106503646B
- Authority
- CN
- China
- Prior art keywords
- emotion
- emotion recognition
- recognition result
- voice signal
- visual image
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 230000008909 emotion recognition Effects 0.000 title claims abstract description 259
- 238000000034 method Methods 0.000 title claims abstract description 16
- 230000008451 emotion Effects 0.000 claims abstract description 127
- 230000000007 visual effect Effects 0.000 claims abstract description 58
- 230000002996 emotional effect Effects 0.000 claims abstract description 33
- 230000003595 spectral effect Effects 0.000 claims description 45
- 230000008921 facial expression Effects 0.000 claims description 26
- 230000033001 locomotion Effects 0.000 claims description 21
- 238000013507 mapping Methods 0.000 claims description 21
- 238000001228 spectrum Methods 0.000 claims description 15
- 230000006399 behavior Effects 0.000 claims description 13
- 230000011218 segmentation Effects 0.000 claims description 11
- 230000003542 behavioural effect Effects 0.000 claims description 6
- 230000001815 facial effect Effects 0.000 claims description 6
- 230000009194 climbing Effects 0.000 claims description 5
- 230000033764 rhythmic process Effects 0.000 claims description 5
- 238000010801 machine learning Methods 0.000 claims description 4
- 239000000284 extract Substances 0.000 claims description 3
- 230000008447 perception Effects 0.000 claims 1
- 230000003993 interaction Effects 0.000 description 4
- 238000010586 diagram Methods 0.000 description 2
- 238000013136 deep learning model Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/161—Detection; Localisation; Normalisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/211—Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/168—Feature extraction; Face representation
- G06V40/171—Local features and components; Facial parts ; Occluding parts, e.g. glasses; Geometrical relationships
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/172—Classification, e.g. identification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/174—Facial expression recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/1807—Speech classification or search using natural language modelling using prosody or stress
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/1822—Parsing for meaning understanding
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Multimedia (AREA)
- Human Computer Interaction (AREA)
- General Health & Medical Sciences (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Oral & Maxillofacial Surgery (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Acoustics & Sound (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Psychiatry (AREA)
- Social Psychology (AREA)
- Life Sciences & Earth Sciences (AREA)
- User Interface Of Digital Computer (AREA)
Abstract
The invention provides a multi-mode emotion recognition system and a method, wherein the system comprises a voice receiver, a first emotion recognition subsystem, a second emotion recognition subsystem, a visual image receiver, a third emotion recognition subsystem and an emotion output device; the voice receiver is used for receiving a voice signal sent by a target object; a visual image receiver for receiving visual image data about a target object; the first emotion recognition subsystem is used for acquiring a first emotion recognition result according to the voice signal; the second emotion recognition subsystem is used for acquiring a second emotion recognition result according to the voice signal; the third emotion recognition subsystem is used for acquiring a third emotion recognition result according to the visual image data; and the emotion outputter is used for determining the emotional state of the target object according to the first emotion recognition result, the second emotion recognition result and the third emotion recognition result.
Description
Technical Field
The invention relates to a computer processing technology, in particular to a multi-mode emotion recognition system and a method.
Background
At present, an emotion recognition machine usually recognizes human emotion by adopting one of a character recognition technology, a voice recognition technology or a visual image recognition technology, the emotion recognition mode is single, the amount of information adopted in emotion recognition is small, and human emotion recognition under a complex situation is difficult to realize.
Disclosure of Invention
The invention provides a multi-mode emotion recognition system and method, which are used for combining a character recognition technology, a voice recognition technology and a visual image recognition technology and simultaneously performing human emotion recognition from a plurality of channels so that an emotion recognition machine can accurately recognize the emotion of a target object in a human-computer interaction process.
In order to solve the technical problems, the technical scheme provided by the invention is as follows:
in one aspect, the present invention provides a multi-modal emotion recognition system, including: the system comprises a voice receiver, a first emotion recognition subsystem, a second emotion recognition subsystem, a visual image receiver, a third emotion recognition subsystem and an emotion output device; the voice receiver is used for receiving a voice signal sent by a target object; a visual image receiver for receiving visual image data about a target object; the first emotion recognition subsystem is used for acquiring a first emotion recognition result according to the voice signal; the second emotion recognition subsystem is used for acquiring a second emotion recognition result according to the voice signal; the third emotion recognition subsystem is used for acquiring a third emotion recognition result according to the visual image data; and the emotion outputter is used for determining the emotional state of the target object according to the first emotion recognition result, the second emotion recognition result and the third emotion recognition result.
Further, the first emotion recognition subsystem specifically comprises an emotion significance divider and a first emotion recognizer; the emotion significance decollator is used for extracting acoustic prosodic features from a voice signal of the voice receiver; the first emotion recognizer is used for acquiring a first emotion recognition result of the voice signal according to the acoustic prosodic features; the second emotion recognition subsystem specifically comprises a voice recognizer, a sentence characteristic value extractor and a second emotion recognizer; a speech recognizer for converting a speech signal of the speech receiver into a sequence of words; a sentence characteristic value extractor for extracting a sentence characteristic value in the character sequence; the second emotion recognizer is used for acquiring a second emotion recognition result of the voice signal according to the statement characteristic value; the third emotion recognition subsystem specifically comprises a face recognition tracker, a human body recognition tracker, a facial expression feature extractor, a body action feature extractor and a third emotion recognizer; the face recognition tracker is used for recognizing and tracking face data in the visual image data; a human body recognition tracker for recognizing and tracking the entire human body data including the head in the visual image data; the facial expression feature extractor is used for extracting facial key points in the face data and acquiring facial expression feature values according to the facial key points; the body motion characteristic extractor is used for extracting body motion key points in the human body data and acquiring body motion characteristic values according to the body motion key points; the third emotion recognizer is used for acquiring a third emotion recognition result of the visual image data according to the facial expression characteristic value and the body action characteristic value; and the emotion outputter is used for determining the emotional state of the target object according to the first emotion recognition result, the second emotion recognition result, the third emotion recognition result and a pre-constructed psychology and behavior mapping relation map.
The first emotion recognizer is used for substituting the acoustic prosodic features into a pre-constructed brain-like machine learning model to obtain nerve-like voice features, and substituting the nerve-like voice features into a pre-stored emotion model to obtain a first emotion of the voice signal and a first emotion recognition confidence corresponding to the first emotion.
Further, the acoustic rhythm characteristics include pitch, intensity, timbre, sound spectrum, cepstrum, linear perceptual prediction cepstrum coefficient, root-mean-square intensity, zero crossing rate, spectral flow, spectral centroid, frequency bandwidth, spectral quotient, spectral flatness, spectral slope, spectral sharpness, sound chromaticity, spectral attenuation point, spectral slope, single-frequency overtone, sound probability, sound formant, voice climbing point, and spectral envelope.
Further, the sentence characteristic value extractor extracts the sentence characteristic values in the character sequence, and specifically comprises the steps of performing word segmentation processing on the character sequence to obtain word segmentation characteristic values, performing word category analysis on the character sequence to obtain word category characteristic values, and performing sentence pattern syntactic analysis on the character sequence to obtain sentence pattern syntactic characteristic values; and the second emotion recognizer is used for inputting the word segmentation characteristic value, the word category characteristic value and the sentence pattern syntactic characteristic value in the sentence characteristic value into a pre-constructed text emotion recognition model so as to obtain a second emotion of the voice signal and a second emotion recognition confidence corresponding to the second emotion.
Further, the third emotion recognizer obtains a third emotion recognition result of the visual image data according to the facial expression feature value and the body motion feature value, and specifically includes the third emotion recognizer, substitutes the facial expression feature value and the body motion feature value into a pre-constructed emotion classifier, so as to obtain a third emotion of the visual image data and a third emotion recognition confidence corresponding to the third emotion.
Further, the emotion output device determines the emotional state of the target object according to the first emotion recognition result, the second emotion recognition result, the third emotion recognition result and a pre-constructed psychology and behavior mapping relation map, and specifically includes that when any one of a first emotion recognition confidence of the first emotion recognition result, a second emotion recognition confidence of the second emotion recognition result and a third emotion recognition confidence of the third emotion recognition result is greater than or equal to a set threshold, the emotion corresponding to the emotion recognition confidence is determined as the emotional state of the target object; when the first emotion recognition confidence of the first emotion recognition result, the second emotion recognition confidence of the second emotion recognition result and the third emotion recognition confidence of the third emotion recognition result are all smaller than a set threshold value, respectively calculating emotion labels for the first emotion of the first emotion recognition result, the second emotion of the second emotion recognition result and the third emotion of the third emotion recognition result according to a preset weight rule to obtain a first emotion label, a second emotion label and a third emotion label; and determining the emotional state of the target object according to the first emotional tag, the second emotional tag and the third emotional tag and according to a pre-constructed psychology and behavior mapping relation atlas.
In another aspect, the present invention provides a multi-modal emotion recognition method, including: the voice receiver receives a voice signal sent by a target object; a visual image receiver receiving visual image data about a target object; the first emotion recognition subsystem acquires a first emotion recognition result according to the voice signal; the second emotion recognition subsystem acquires a second emotion recognition result according to the voice signal; the third emotion recognition subsystem acquires a third emotion recognition result according to the visual image data; and the emotion output device determines the emotional state of the target object according to the first emotion identification result, the second emotion identification result and the third emotion identification result.
Further, the first emotion recognition subsystem acquires a first emotion recognition result according to the voice signal, and specifically comprises the steps of extracting acoustic prosody features from the voice signal of the voice receiver; acquiring a first emotion recognition result of the voice signal according to the acoustic rhythm characteristics; the second emotion recognition subsystem acquires a second emotion recognition result according to the voice signal, and specifically comprises the steps of converting the voice signal of the voice receiver into a character sequence; extracting a sentence characteristic value in the character sequence; acquiring a second emotion recognition result of the voice signal according to the sentence characteristic value; the third emotion recognition subsystem acquires a third emotion recognition result according to the visual image data, and specifically comprises the steps of recognizing and tracking face data in the visual image data; recognizing and tracking the whole human body data including the head in the visual image data; extracting face key points in the face data, and acquiring facial expression characteristic values according to the face key points; extracting body action key points in the human body data, and acquiring body action characteristic values according to the body action key points; acquiring a third emotion recognition result of the visual image data according to the facial expression characteristic value and the body action characteristic value; the emotion follower determines the emotional state of the target object according to the first emotion recognition result, the second emotion recognition result and the third emotion recognition result, and specifically comprises the step of determining the emotional state of the target object according to the first emotion recognition result, the second emotion recognition result and the third emotion recognition result and a pre-constructed psychology and behavior mapping relation map.
Further, the acoustic rhythm characteristics include pitch, intensity, timbre, sound spectrum, cepstrum, linear perceptual prediction cepstrum coefficient, root-mean-square intensity, zero crossing rate, spectral flow, spectral centroid, frequency bandwidth, spectral quotient, spectral flatness, spectral slope, spectral sharpness, sound chromaticity, spectral attenuation point, spectral slope, single-frequency overtone, sound probability, sound formant, voice climbing point, and spectral envelope.
The multi-mode emotion recognition system and method provided by the invention are integrated with a character recognition technology, a voice recognition technology and a visual image recognition technology, and human emotion recognition is carried out from a plurality of channels, so that an emotion recognition machine can accurately recognize the emotion of a target object in a man-machine interaction process.
Drawings
FIG. 1 is a block diagram of a multi-modal emotion recognition system provided by an embodiment of the present invention;
FIG. 2 is another block diagram of a multi-modal emotion recognition system provided by an embodiment of the present invention;
FIG. 3 is a flowchart of a multi-modal emotion recognition method provided by an embodiment of the present invention;
FIG. 4 is a flowchart of a multi-modal emotion recognition method according to an embodiment of the present invention.
Detailed Description
The invention is further illustrated by the following specific examples, which, however, are to be construed as merely illustrative, and not limitative of the remainder of the disclosure in any way whatsoever.
Example one
With reference to fig. 1, the multi-modal emotion recognition system provided in this embodiment includes: the system comprises a voice receiver 1, a first emotion recognition subsystem 3, a second emotion recognition subsystem 4, a visual image receiver 2, a third emotion recognition subsystem 5 and an emotion output device 6; a voice receiver 1 for receiving a voice signal emitted from a target object; a visual image receiver 2 for receiving visual image data on a target object; the first emotion recognition subsystem 3 is used for acquiring a first emotion recognition result according to the voice signal; the second emotion recognition subsystem 4 is used for acquiring a second emotion recognition result according to the voice signal; the third emotion recognition subsystem 5 is used for acquiring a third emotion recognition result according to the visual image data; and the emotion outputter 6 is used for determining the emotional state of the target object according to the first emotion recognition result, the second emotion recognition result and the third emotion recognition result.
Preferably, as shown in fig. 2, the first emotion recognition subsystem 3 specifically includes an emotion saliency divider 301, a first emotion recognizer 302; an emotion saliency divider 301 for extracting acoustic prosody features from the speech signal of the speech receiver 1; a first emotion recognizer 302, configured to obtain a first emotion recognition result of the speech signal according to the acoustic prosody feature; the second emotion recognition subsystem 4 specifically includes a speech recognizer 401, a sentence feature value extractor 402, and a second emotion recognizer 403; a speech recognizer 401 for converting the speech signal of the speech receiver 1 into a sequence of words; a sentence characteristic value extractor 402 for extracting a sentence characteristic value in the text sequence; a second emotion recognizer 403, configured to obtain a second emotion recognition result of the speech signal according to the sentence feature value; the third emotion recognition subsystem 5 specifically includes a face recognition tracker 501, a human body recognition tracker 503, a facial expression feature extractor 502, a body action feature extractor 504, and a third emotion recognizer 505; a face recognition tracker 501 for recognizing and tracking face data in the visual image data; a human body recognition tracker 503 for recognizing and tracking the entire human body data including the head in the visual image data; a facial expression feature extractor 502, configured to extract facial key points in the face data, and obtain facial expression feature values according to the facial key points; a body motion feature extractor 504, configured to extract body motion key points in the human body data, and obtain a body motion feature value according to the body motion key points; a third emotion recognizer 505, configured to obtain a third emotion recognition result of the visual image data according to the facial expression feature value and the body motion feature value; and the emotion outputter 6 is used for determining the emotional state of the target object according to the first emotion recognition result, the second emotion recognition result, the third emotion recognition result and a pre-constructed psychology and behavior mapping relation map.
The multi-mode emotion recognition system provided by the embodiment of the invention integrates a character recognition technology, a voice recognition technology and a visual image recognition technology, and simultaneously carries out human emotion recognition from a plurality of channels, so that an emotion recognition machine can accurately recognize the emotion of a target object in the human-computer interaction process.
It should be noted that the psychology mapping relationship map mentioned in this embodiment is a relationship library pre-constructed according to behavioral psychology relationships, and is substantially a mapping relationship map from a human behavioral representation to a human real emotion.
Further preferably, the first emotion recognizer 302 obtains the first emotion recognition result of the speech signal according to the acoustic prosody feature, and specifically includes that the first emotion recognizer 302 substitutes the acoustic prosody feature into a pre-constructed brain-like machine learning model to obtain a neural-like speech feature, and substitutes the neural-like speech feature into a pre-stored emotion model to obtain the first emotion of the speech signal and the first emotion recognition confidence corresponding to the first emotion.
Specifically, the acoustic prosodic features include pitch, intensity, timbre, sound spectrum, cepstrum, linear perceptual prediction cepstrum coefficient, root-mean-square intensity, zero crossing rate, spectral flow, spectral centroid, frequency bandwidth, spectral quotient, spectral flatness, spectral slope, spectral sharpness, sound chromaticity, spectral attenuation point, spectral slope, single-frequency overtone, sound probability, sound formant, speech climbing point, spectral envelope.
Further preferably, the sentence characteristic value extractor 402 extracts the sentence characteristic values in the text sequence, specifically including performing word segmentation processing on the text sequence to obtain word segmentation characteristic values, performing word category analysis on the text sequence to obtain word category characteristic values, and performing sentence pattern syntactic analysis on the text sequence to obtain sentence pattern syntactic characteristic values; the second emotion recognizer 403 obtains a second emotion recognition result of the speech signal according to the sentence feature value, and specifically includes the second emotion recognizer 403 inputting the participle feature value, the word category feature value, and the sentence pattern syntactic feature value in the sentence feature value into a pre-constructed text emotion recognition model to obtain a second emotion of the speech signal and a second emotion recognition confidence corresponding to the second emotion.
Further preferably, the third emotion recognizer 505 obtains a third emotion recognition result of the visual image data according to the facial expression feature value and the body motion feature value, and specifically includes the third emotion recognizer 505, which substitutes the facial expression feature value and the body motion feature value into a pre-constructed emotion classifier to obtain a third emotion of the visual image data and a third emotion recognition confidence corresponding to the third emotion.
In this embodiment, the facial expressions are combined with body movements to perform emotion recognition on the visual image data, so that the emotion recognition rate can be improved. For example, when a person is standing out of the chest, the lower jaw is raised, and the face belt is smiling, the corresponding feeling is pride, but when only the chest and lower jaw are raised, or only smile is generated, the feeling of pride cannot be judged. In addition, the present embodiment combines the psychological research results of Paul Ekman, and adopts a deep learning model of facial expressions and body movements to distinguish human emotions.
Further preferably, the emotion follower 6 determines the emotional state of the target object according to the first emotion recognition result, the second emotion recognition result, the third emotion recognition result and the pre-constructed psychology mapping relationship map, and specifically includes that when any one of the first emotion recognition confidence of the first emotion recognition result, the second emotion recognition confidence of the second emotion recognition result and the third emotion recognition confidence of the third emotion recognition result is greater than or equal to a set threshold, the emotion corresponding to the emotion recognition confidence is determined as the emotional state of the target object; when the first emotion recognition confidence of the first emotion recognition result, the second emotion recognition confidence of the second emotion recognition result and the third emotion recognition confidence of the third emotion recognition result are all smaller than a set threshold value, respectively calculating emotion labels for the first emotion of the first emotion recognition result, the second emotion of the second emotion recognition result and the third emotion of the third emotion recognition result according to a preset weight rule to obtain a first emotion label, a second emotion label and a third emotion label; and determining the emotional state of the target object according to the first emotional tag, the second emotional tag and the third emotional tag and according to a pre-constructed psychology and behavior mapping relation atlas.
Example two
With reference to fig. 3, an embodiment of the present invention provides a multi-modal emotion recognition method, including:
step S1: the voice receiver 1 receives a voice signal sent by a target object;
step S2: the visual image receiver 2 receives visual image data on a target object;
step S3: the first emotion recognition subsystem 3 acquires a first emotion recognition result according to the voice signal;
step S4: the second emotion recognition subsystem 4 acquires a second emotion recognition result according to the voice signal;
step S5: the third emotion recognition subsystem 5 acquires a third emotion recognition result according to the visual image data;
step S6: and the emotion outputter 6 determines the emotional state of the target object according to the first emotion recognition result, the second emotion recognition result and the third emotion recognition result.
Preferably, as shown in fig. 4, the first emotion recognition subsystem 3 obtains a first emotion recognition result according to the voice signal, specifically including,
step S3.1: extracting acoustic prosodic features from a speech signal of the speech receiver 1;
step S3.2: acquiring a first emotion recognition result of the voice signal according to the acoustic rhythm characteristics;
the second emotion recognition subsystem 4 obtains a second emotion recognition result according to the voice signal, which specifically includes,
step S4.1: converting the speech signal of the speech receiver 1 into a sequence of words;
step S4.2: extracting a sentence characteristic value in the character sequence;
step S4.3: acquiring a second emotion recognition result of the voice signal according to the sentence characteristic value;
the third emotion recognition subsystem 5 obtains a third emotion recognition result according to the visual image data, and specifically includes,
step S5.1: recognizing and tracking face data in the visual image data;
step S5.2: recognizing and tracking the whole human body data including the head in the visual image data;
step S5.3: extracting face key points in the face data, and acquiring facial expression characteristic values according to the face key points;
step S5.4: extracting body action key points in the human body data, and acquiring body action characteristic values according to the body action key points;
step S5.5: acquiring a third emotion recognition result of the visual image data according to the facial expression characteristic value and the body action characteristic value;
the emotion output unit 6 determines the emotional state of the target object according to the first emotion recognition result, the second emotion recognition result and the third emotion recognition result, which specifically includes,
step S6.1: and determining the emotional state of the target object according to the first emotion recognition result, the second emotion recognition result, the third emotion recognition result and a pre-constructed psychology and behavior mapping relation map.
The multi-mode emotion recognition method provided by the embodiment of the invention integrates a character recognition technology, a voice recognition technology and a visual image recognition technology, and simultaneously carries out human emotion recognition from a plurality of channels, so that an emotion recognition machine can accurately recognize the emotion of a target object in the human-computer interaction process.
It should be noted that the psychology mapping relationship map mentioned in this embodiment is a relationship library pre-constructed according to behavioral psychology relationships, and is substantially a mapping relationship map from a human behavioral representation to a human real emotion.
Specifically, the acoustic prosodic features include pitch, intensity, timbre, sound spectrum, cepstrum, linear perceptual prediction cepstrum coefficient, root-mean-square intensity, zero crossing rate, spectral flow, spectral centroid, frequency bandwidth, spectral quotient, spectral flatness, spectral slope, spectral sharpness, sound chromaticity, spectral attenuation point, spectral slope, single-frequency overtone, sound probability, sound formant, speech climbing point, spectral envelope.
Although the present invention has been described to a certain extent, it is apparent that appropriate changes in the respective conditions may be made without departing from the spirit and scope of the present invention. It is to be understood that the invention is not limited to the described embodiments, but is to be accorded the scope consistent with the claims, including equivalents of each element described.
Claims (5)
1. A multi-modal emotion recognition system, comprising: the system comprises a voice receiver, a first emotion recognition subsystem, a second emotion recognition subsystem, a visual image receiver, a third emotion recognition subsystem and an emotion output device;
the voice receiver is used for receiving a voice signal sent by a target object;
the visual image receiver is used for receiving visual image data about the target object;
the first emotion recognition subsystem is used for acquiring a first emotion recognition result according to the voice signal;
the second emotion recognition subsystem is used for acquiring a second emotion recognition result according to the voice signal;
the third emotion recognition subsystem is used for acquiring a third emotion recognition result according to the visual image data;
the emotion outputter is used for determining the emotional state of the target object according to the first emotion recognition result, the second emotion recognition result and the third emotion recognition result;
the first emotion recognition subsystem specifically comprises an emotion significance divider and a first emotion recognizer;
the emotion significance decollator is used for extracting acoustic prosodic features from the voice signal of the voice receiver;
the first emotion recognizer is used for acquiring the first emotion recognition result of the voice signal according to the acoustic prosodic feature;
the second emotion recognition subsystem specifically comprises a voice recognizer, a sentence characteristic value extractor and a second emotion recognizer;
the voice recognizer is used for converting the voice signal of the voice receiver into a character sequence;
the sentence characteristic value extractor is used for extracting the sentence characteristic values in the character sequence;
the second emotion recognizer is used for acquiring a second emotion recognition result of the voice signal according to the statement feature value;
the third emotion recognition subsystem specifically comprises a face recognition tracker, a human body recognition tracker, a facial expression feature extractor, a body action feature extractor and a third emotion recognizer;
the face recognition tracker is used for recognizing and tracking face data in the visual image data;
the human body identification tracker is used for identifying and tracking the whole human body data including the head in the visual image data;
the facial expression feature extractor is used for extracting facial key points in the face data and acquiring facial expression feature values according to the facial key points;
the body motion characteristic extractor is used for extracting body motion key points in the human body data and acquiring body motion characteristic values according to the body motion key points;
the third emotion recognizer is used for acquiring a third emotion recognition result of the visual image data according to the facial expression characteristic value and the body motion characteristic value;
the emotion follower is used for determining the emotional state of the target object according to the first emotion recognition result, the second emotion recognition result, the third emotion recognition result and a pre-constructed psychology mapping relation map, wherein the psychology mapping relation map is a relation library pre-constructed according to behavioral psychology relations, and is essentially a mapping relation map from the behavioral expression of a person to the real emotion of the person;
the sentence characteristic value extractor extracts the sentence characteristic values in the character sequence, specifically comprising,
performing word segmentation processing on the character sequence to obtain word segmentation characteristic values, performing word category analysis on the character sequence to obtain word category characteristic values, and performing sentence pattern syntactic analysis on the character sequence to obtain sentence pattern syntactic characteristic values;
the second emotion recognizer obtains the second emotion recognition result of the voice signal according to the sentence characteristic value, which specifically comprises,
the second emotion recognizer inputs the word segmentation feature values, the word category feature values and the sentence pattern syntactic feature values in the sentence feature values into a pre-constructed text emotion recognition model to acquire a second emotion of the voice signal and a second emotion recognition confidence corresponding to the second emotion;
the first emotion recognizer obtains the first emotion recognition result of the voice signal according to the acoustic prosodic feature, and specifically comprises,
and the first emotion recognizer substitutes the acoustic prosodic features into a pre-constructed brain-like machine learning model to acquire nerve-like voice features, and substitutes the nerve-like voice features into a pre-stored emotion model to acquire a first emotion of the voice signal and a first emotion recognition confidence corresponding to the first emotion.
2. The system of claim 1, wherein the acoustic prosodic features include pitch, intensity, timbre, sound spectrum, cepstrum, linear perceptual prediction cepstral coefficients, root-mean-square intensity, zero crossing rate, spectral flow, spectral centroid, frequency bandwidth, spectral quotient, spectral flatness, spectral slope, spectral sharpness, sound chromaticity, spectral attenuation point, spectral slope, single frequency overtone, sound probability, sound formant, speech climb point, spectral envelope.
3. The multi-modal emotion recognition system of claim 1, wherein the third emotion recognizer obtains a third emotion recognition result of the visual image data based on the facial expression feature value and the body motion feature value, and in particular comprises,
and the third emotion recognizer substitutes the facial expression characteristic value and the body action characteristic value into a pre-constructed emotion classifier to obtain a third emotion of the visual image data and a third emotion recognition confidence corresponding to the third emotion.
4. The multimodal emotion recognition system of any of claims 1 to 3, wherein the emotion follower determines the emotional state of the target object based on the first emotion recognition result, the second emotion recognition result, the third emotion recognition result, and a pre-constructed psycho-behavioral mapping relation map, and in particular comprises,
judging the emotion corresponding to the emotion recognition confidence coefficient as the emotion state of the target object when any one of the first emotion recognition confidence coefficient of the first emotion recognition result, the second emotion recognition confidence coefficient of the second emotion recognition result and the third emotion recognition confidence coefficient of the third emotion recognition result is greater than or equal to a set threshold;
when the first emotion recognition confidence of the first emotion recognition result, the second emotion recognition confidence of the second emotion recognition result and the third emotion recognition confidence of the third emotion recognition result are all smaller than a set threshold value, respectively calculating emotion labels for the first emotion of the first emotion recognition result and the second emotion of the second emotion recognition result and the third emotion of the third emotion recognition result according to a preset weight rule to obtain a first emotion label, a second emotion label and a third emotion label;
and determining the emotional state of the target object according to the first emotional label, the second emotional label, the third emotional label and the pre-constructed psychology and behavior mapping relation atlas.
5. A multi-modal emotion recognition method, comprising:
the voice receiver receives a voice signal sent by a target object;
a visual image receiver receiving visual image data about the target object;
the first emotion recognition subsystem acquires a first emotion recognition result according to the voice signal;
the second emotion recognition subsystem acquires a second emotion recognition result according to the voice signal;
the third emotion recognition subsystem acquires a third emotion recognition result according to the visual image data;
the emotion output device determines the emotional state of the target object according to the first emotion recognition result, the second emotion recognition result and the third emotion recognition result;
the first emotion recognition subsystem acquires a first emotion recognition result according to the voice signal, and specifically comprises,
extracting acoustic prosodic features from the speech signal of the speech receiver;
acquiring the first emotion recognition result of the voice signal according to the acoustic prosodic feature;
the second emotion recognition subsystem acquires a second emotion recognition result according to the voice signal, and specifically comprises,
converting the speech signal of the speech receiver into a sequence of words;
extracting a sentence characteristic value in the character sequence;
acquiring the second emotion recognition result of the voice signal according to the statement characteristic value;
the third emotion recognition subsystem acquires a third emotion recognition result according to the visual image data, and specifically comprises,
recognizing and tracking face data in the visual image data;
identifying and tracking whole human body data including a head in the visual image data;
extracting face key points in the face data, and acquiring facial expression characteristic values according to the face key points;
extracting body action key points in the human body data, and acquiring body action characteristic values according to the body action key points;
acquiring a third emotion recognition result of the visual image data according to the facial expression characteristic value and the body action characteristic value;
the emotion follower determines the emotional state of the target object according to the first emotion recognition result, the second emotion recognition result and the third emotion recognition result, which specifically comprises,
determining the emotional state of the target object according to the first emotion recognition result, the second emotion recognition result, the third emotion recognition result and a pre-constructed psychology and behavior mapping relation atlas; the psychology and behavior mapping relation map is a relation library constructed in advance according to behavior and psychology relations, and the essence of the psychology and behavior mapping relation map is a mapping relation map from behavior appearance of a person to real emotion of the person;
the extracting of the sentence characteristic values in the text sequence specifically comprises,
performing word segmentation processing on the character sequence to obtain word segmentation characteristic values, performing word category analysis on the character sequence to obtain word category characteristic values, and performing sentence pattern syntactic analysis on the character sequence to obtain sentence pattern syntactic characteristic values;
obtaining the second emotion recognition result of the speech signal according to the sentence characteristic value, specifically including,
inputting the word segmentation characteristic values, the word category characteristic values and the sentence pattern syntactic characteristic values in the sentence characteristic values into a pre-constructed text emotion recognition model to acquire a second emotion of the voice signal and a second emotion recognition confidence corresponding to the second emotion;
obtaining the first emotion recognition result of the voice signal according to the acoustic prosody feature, specifically comprising,
substituting the acoustic prosodic features into a pre-constructed brain-like machine learning model to obtain nerve-like voice features, and substituting the nerve-like voice features into a pre-stored emotion model to obtain a first emotion of the voice signal and a first emotion recognition confidence corresponding to the first emotion;
the acoustic rhythm characteristics comprise pitch, intensity, tone quality, sound spectrum, cepstrum, linear perception prediction cepstrum coefficient, root-mean-square intensity, zero crossing rate, frequency spectrum flow, frequency spectrum mass center, frequency bandwidth, frequency spectrum quotient, frequency spectrum flatness, frequency spectrum inclination, frequency spectrum sharpness, sound chromaticity, frequency spectrum attenuation point, frequency spectrum inclination, single-frequency overtone, sound probability, sound formant, voice climbing point and frequency spectrum envelope.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610912302.5A CN106503646B (en) | 2016-10-19 | 2016-10-19 | Multi-mode emotion recognition system and method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610912302.5A CN106503646B (en) | 2016-10-19 | 2016-10-19 | Multi-mode emotion recognition system and method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106503646A CN106503646A (en) | 2017-03-15 |
CN106503646B true CN106503646B (en) | 2020-07-10 |
Family
ID=58294258
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610912302.5A Active CN106503646B (en) | 2016-10-19 | 2016-10-19 | Multi-mode emotion recognition system and method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106503646B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US12080285B2 (en) | 2020-04-22 | 2024-09-03 | Google Llc | Semi-delegated calling by an automated assistant on behalf of human participant |
Families Citing this family (21)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107194151B (en) * | 2017-04-20 | 2020-04-03 | 华为技术有限公司 | Method for determining emotion threshold value and artificial intelligence equipment |
CN107092895A (en) * | 2017-05-09 | 2017-08-25 | 重庆邮电大学 | A kind of multi-modal emotion identification method based on depth belief network |
CN108664932B (en) * | 2017-05-12 | 2021-07-09 | 华中师范大学 | Learning emotional state identification method based on multi-source information fusion |
CN107180236B (en) * | 2017-06-02 | 2020-02-11 | 北京工业大学 | Multi-modal emotion recognition method based on brain-like model |
CN109254669B (en) * | 2017-07-12 | 2022-05-10 | 腾讯科技(深圳)有限公司 | Expression picture input method and device, electronic equipment and system |
CN107943299B (en) * | 2017-12-07 | 2022-05-06 | 上海智臻智能网络科技股份有限公司 | Emotion presenting method and device, computer equipment and computer readable storage medium |
US10783329B2 (en) | 2017-12-07 | 2020-09-22 | Shanghai Xiaoi Robot Technology Co., Ltd. | Method, device and computer readable storage medium for presenting emotion |
CN109903392B (en) | 2017-12-11 | 2021-12-31 | 北京京东尚科信息技术有限公司 | Augmented reality method and apparatus |
CN108091323B (en) * | 2017-12-19 | 2020-10-13 | 想象科技(北京)有限公司 | Method and apparatus for emotion recognition from speech |
CN108108849A (en) * | 2017-12-31 | 2018-06-01 | 厦门大学 | A kind of microblog emotional Forecasting Methodology based on Weakly supervised multi-modal deep learning |
CN110085211B (en) * | 2018-01-26 | 2021-06-29 | 上海智臻智能网络科技股份有限公司 | Voice recognition interaction method and device, computer equipment and storage medium |
CN108899050B (en) * | 2018-06-14 | 2020-10-02 | 南京云思创智信息科技有限公司 | Voice signal analysis subsystem based on multi-modal emotion recognition system |
CN109241912B (en) * | 2018-09-08 | 2020-08-07 | 河南大学 | Target identification method based on brain-like cross-media intelligence and oriented to unmanned autonomous system |
US20200099634A1 (en) * | 2018-09-20 | 2020-03-26 | XRSpace CO., LTD. | Interactive Responding Method and Computer System Using the Same |
CN109829363A (en) * | 2018-12-18 | 2019-05-31 | 深圳壹账通智能科技有限公司 | Expression recognition method, device, computer equipment and storage medium |
CN110033029A (en) * | 2019-03-22 | 2019-07-19 | 五邑大学 | A kind of emotion identification method and device based on multi-modal emotion model |
CN110287912A (en) * | 2019-06-28 | 2019-09-27 | 广东工业大学 | Method, apparatus and medium are determined based on the target object affective state of deep learning |
CN110688499A (en) * | 2019-08-13 | 2020-01-14 | 深圳壹账通智能科技有限公司 | Data processing method, data processing device, computer equipment and storage medium |
CN110910903B (en) * | 2019-12-04 | 2023-03-21 | 深圳前海微众银行股份有限公司 | Speech emotion recognition method, device, equipment and computer readable storage medium |
US11630999B2 (en) * | 2019-12-19 | 2023-04-18 | Dish Network Technologies India Private Limited | Method and system for analyzing customer calls by implementing a machine learning model to identify emotions |
CN113128284A (en) * | 2019-12-31 | 2021-07-16 | 上海汽车集团股份有限公司 | Multi-mode emotion recognition method and device |
Family Cites Families (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2007092795A2 (en) * | 2006-02-02 | 2007-08-16 | Neuric Technologies, Llc | Method for movie animation |
KR101840644B1 (en) * | 2011-05-31 | 2018-03-22 | 한국전자통신연구원 | System of body gard emotion cognitive-based, emotion cognitive device, image and sensor controlling appararus, self protection management appararus and method for controlling the same |
CN102298694A (en) * | 2011-06-21 | 2011-12-28 | 广东爱科数字科技有限公司 | Man-machine interaction identification system applied to remote information service |
CN102881284B (en) * | 2012-09-03 | 2014-07-09 | 江苏大学 | Unspecific human voice and emotion recognition method and system |
US9031293B2 (en) * | 2012-10-19 | 2015-05-12 | Sony Computer Entertainment Inc. | Multi-modal sensor based emotion recognition and emotional interface |
CN103123619B (en) * | 2012-12-04 | 2015-10-28 | 江苏大学 | Based on the multi-modal Cooperative Analysis method of the contextual visual speech of emotion |
CN103456314B (en) * | 2013-09-03 | 2016-02-17 | 广州创维平面显示科技有限公司 | A kind of emotion identification method and device |
CN105334743B (en) * | 2015-11-18 | 2018-10-26 | 深圳创维-Rgb电子有限公司 | A kind of intelligent home furnishing control method and its system based on emotion recognition |
CN105739688A (en) * | 2016-01-21 | 2016-07-06 | 北京光年无限科技有限公司 | Man-machine interaction method and device based on emotion system, and man-machine interaction system |
CN105975594A (en) * | 2016-05-09 | 2016-09-28 | 清华大学 | Sentiment classification method and device based on combined feature vector and SVM[perf] (Support Vector Machine) |
CN105976809B (en) * | 2016-05-25 | 2019-12-17 | 中国地质大学(武汉) | Identification method and system based on speech and facial expression bimodal emotion fusion |
CN105869657A (en) * | 2016-06-03 | 2016-08-17 | 竹间智能科技(上海)有限公司 | System and method for identifying voice emotion |
-
2016
- 2016-10-19 CN CN201610912302.5A patent/CN106503646B/en active Active
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US12080285B2 (en) | 2020-04-22 | 2024-09-03 | Google Llc | Semi-delegated calling by an automated assistant on behalf of human participant |
Also Published As
Publication number | Publication date |
---|---|
CN106503646A (en) | 2017-03-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106503646B (en) | Multi-mode emotion recognition system and method | |
US10433052B2 (en) | System and method for identifying speech prosody | |
Jing et al. | Prominence features: Effective emotional features for speech emotion recognition | |
US7369991B2 (en) | Speech recognition system, speech recognition method, speech synthesis system, speech synthesis method, and program product having increased accuracy | |
CN116547746A (en) | Dialog management for multiple users | |
EP3198589B1 (en) | Method and apparatus to synthesize voice based on facial structures | |
Mariooryad et al. | Compensating for speaker or lexical variabilities in speech for emotion recognition | |
CN103996155A (en) | Intelligent interaction and psychological comfort robot service system | |
CN107972028B (en) | Man-machine interaction method and device and electronic equipment | |
KR20210155401A (en) | Speech synthesis apparatus for evaluating the quality of synthesized speech using artificial intelligence and method of operation thereof | |
Chandrasekar et al. | Automatic speech emotion recognition: A survey | |
Sethu et al. | Speech based emotion recognition | |
KR102607373B1 (en) | Apparatus and method for recognizing emotion in speech | |
JP2018159788A (en) | Information processing device, method and program | |
CN112466287B (en) | Voice segmentation method, device and computer readable storage medium | |
CN104538025A (en) | Method and device for converting gestures to Chinese and Tibetan bilingual voices | |
Zbancioc et al. | A study about the automatic recognition of the anxiety emotional state using Emo-DB | |
CN117352000A (en) | Speech classification method, device, electronic equipment and computer readable medium | |
Huang et al. | A review of automated intelligibility assessment for dysarthric speakers | |
CN116580691A (en) | Speech synthesis method, speech synthesis device, electronic device, and storage medium | |
Eyben et al. | Audiovisual vocal outburst classification in noisy acoustic conditions | |
CN116543797A (en) | Emotion recognition method and device based on voice, electronic equipment and storage medium | |
Mishra et al. | Real time emotion detection from speech using Raspberry Pi 3 | |
Bojanić et al. | Application of dimensional emotion model in automatic emotional speech recognition | |
Rabiei et al. | A methodology for recognition of emotions based on speech analysis, for applications to human-robot interaction. An exploratory study |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |