CN106503646B - Multi-mode emotion recognition system and method - Google Patents

Multi-mode emotion recognition system and method Download PDF

Info

Publication number
CN106503646B
CN106503646B CN201610912302.5A CN201610912302A CN106503646B CN 106503646 B CN106503646 B CN 106503646B CN 201610912302 A CN201610912302 A CN 201610912302A CN 106503646 B CN106503646 B CN 106503646B
Authority
CN
China
Prior art keywords
emotion
emotion recognition
recognition result
voice signal
visual image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610912302.5A
Other languages
Chinese (zh)
Other versions
CN106503646A (en
Inventor
简仁贤
杨闵淳
林志豪
孙廷伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Emotibot Technologies Ltd
Original Assignee
Emotibot Technologies Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Emotibot Technologies Ltd filed Critical Emotibot Technologies Ltd
Priority to CN201610912302.5A priority Critical patent/CN106503646B/en
Publication of CN106503646A publication Critical patent/CN106503646A/en
Application granted granted Critical
Publication of CN106503646B publication Critical patent/CN106503646B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • G06V40/171Local features and components; Facial parts ; Occluding parts, e.g. glasses; Geometrical relationships
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/172Classification, e.g. identification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/174Facial expression recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1807Speech classification or search using natural language modelling using prosody or stress
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1822Parsing for meaning understanding
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Acoustics & Sound (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

The invention provides a multi-mode emotion recognition system and a method, wherein the system comprises a voice receiver, a first emotion recognition subsystem, a second emotion recognition subsystem, a visual image receiver, a third emotion recognition subsystem and an emotion output device; the voice receiver is used for receiving a voice signal sent by a target object; a visual image receiver for receiving visual image data about a target object; the first emotion recognition subsystem is used for acquiring a first emotion recognition result according to the voice signal; the second emotion recognition subsystem is used for acquiring a second emotion recognition result according to the voice signal; the third emotion recognition subsystem is used for acquiring a third emotion recognition result according to the visual image data; and the emotion outputter is used for determining the emotional state of the target object according to the first emotion recognition result, the second emotion recognition result and the third emotion recognition result.

Description

Multi-mode emotion recognition system and method
Technical Field
The invention relates to a computer processing technology, in particular to a multi-mode emotion recognition system and a method.
Background
At present, an emotion recognition machine usually recognizes human emotion by adopting one of a character recognition technology, a voice recognition technology or a visual image recognition technology, the emotion recognition mode is single, the amount of information adopted in emotion recognition is small, and human emotion recognition under a complex situation is difficult to realize.
Disclosure of Invention
The invention provides a multi-mode emotion recognition system and method, which are used for combining a character recognition technology, a voice recognition technology and a visual image recognition technology and simultaneously performing human emotion recognition from a plurality of channels so that an emotion recognition machine can accurately recognize the emotion of a target object in a human-computer interaction process.
In order to solve the technical problems, the technical scheme provided by the invention is as follows:
in one aspect, the present invention provides a multi-modal emotion recognition system, including: the system comprises a voice receiver, a first emotion recognition subsystem, a second emotion recognition subsystem, a visual image receiver, a third emotion recognition subsystem and an emotion output device; the voice receiver is used for receiving a voice signal sent by a target object; a visual image receiver for receiving visual image data about a target object; the first emotion recognition subsystem is used for acquiring a first emotion recognition result according to the voice signal; the second emotion recognition subsystem is used for acquiring a second emotion recognition result according to the voice signal; the third emotion recognition subsystem is used for acquiring a third emotion recognition result according to the visual image data; and the emotion outputter is used for determining the emotional state of the target object according to the first emotion recognition result, the second emotion recognition result and the third emotion recognition result.
Further, the first emotion recognition subsystem specifically comprises an emotion significance divider and a first emotion recognizer; the emotion significance decollator is used for extracting acoustic prosodic features from a voice signal of the voice receiver; the first emotion recognizer is used for acquiring a first emotion recognition result of the voice signal according to the acoustic prosodic features; the second emotion recognition subsystem specifically comprises a voice recognizer, a sentence characteristic value extractor and a second emotion recognizer; a speech recognizer for converting a speech signal of the speech receiver into a sequence of words; a sentence characteristic value extractor for extracting a sentence characteristic value in the character sequence; the second emotion recognizer is used for acquiring a second emotion recognition result of the voice signal according to the statement characteristic value; the third emotion recognition subsystem specifically comprises a face recognition tracker, a human body recognition tracker, a facial expression feature extractor, a body action feature extractor and a third emotion recognizer; the face recognition tracker is used for recognizing and tracking face data in the visual image data; a human body recognition tracker for recognizing and tracking the entire human body data including the head in the visual image data; the facial expression feature extractor is used for extracting facial key points in the face data and acquiring facial expression feature values according to the facial key points; the body motion characteristic extractor is used for extracting body motion key points in the human body data and acquiring body motion characteristic values according to the body motion key points; the third emotion recognizer is used for acquiring a third emotion recognition result of the visual image data according to the facial expression characteristic value and the body action characteristic value; and the emotion outputter is used for determining the emotional state of the target object according to the first emotion recognition result, the second emotion recognition result, the third emotion recognition result and a pre-constructed psychology and behavior mapping relation map.
The first emotion recognizer is used for substituting the acoustic prosodic features into a pre-constructed brain-like machine learning model to obtain nerve-like voice features, and substituting the nerve-like voice features into a pre-stored emotion model to obtain a first emotion of the voice signal and a first emotion recognition confidence corresponding to the first emotion.
Further, the acoustic rhythm characteristics include pitch, intensity, timbre, sound spectrum, cepstrum, linear perceptual prediction cepstrum coefficient, root-mean-square intensity, zero crossing rate, spectral flow, spectral centroid, frequency bandwidth, spectral quotient, spectral flatness, spectral slope, spectral sharpness, sound chromaticity, spectral attenuation point, spectral slope, single-frequency overtone, sound probability, sound formant, voice climbing point, and spectral envelope.
Further, the sentence characteristic value extractor extracts the sentence characteristic values in the character sequence, and specifically comprises the steps of performing word segmentation processing on the character sequence to obtain word segmentation characteristic values, performing word category analysis on the character sequence to obtain word category characteristic values, and performing sentence pattern syntactic analysis on the character sequence to obtain sentence pattern syntactic characteristic values; and the second emotion recognizer is used for inputting the word segmentation characteristic value, the word category characteristic value and the sentence pattern syntactic characteristic value in the sentence characteristic value into a pre-constructed text emotion recognition model so as to obtain a second emotion of the voice signal and a second emotion recognition confidence corresponding to the second emotion.
Further, the third emotion recognizer obtains a third emotion recognition result of the visual image data according to the facial expression feature value and the body motion feature value, and specifically includes the third emotion recognizer, substitutes the facial expression feature value and the body motion feature value into a pre-constructed emotion classifier, so as to obtain a third emotion of the visual image data and a third emotion recognition confidence corresponding to the third emotion.
Further, the emotion output device determines the emotional state of the target object according to the first emotion recognition result, the second emotion recognition result, the third emotion recognition result and a pre-constructed psychology and behavior mapping relation map, and specifically includes that when any one of a first emotion recognition confidence of the first emotion recognition result, a second emotion recognition confidence of the second emotion recognition result and a third emotion recognition confidence of the third emotion recognition result is greater than or equal to a set threshold, the emotion corresponding to the emotion recognition confidence is determined as the emotional state of the target object; when the first emotion recognition confidence of the first emotion recognition result, the second emotion recognition confidence of the second emotion recognition result and the third emotion recognition confidence of the third emotion recognition result are all smaller than a set threshold value, respectively calculating emotion labels for the first emotion of the first emotion recognition result, the second emotion of the second emotion recognition result and the third emotion of the third emotion recognition result according to a preset weight rule to obtain a first emotion label, a second emotion label and a third emotion label; and determining the emotional state of the target object according to the first emotional tag, the second emotional tag and the third emotional tag and according to a pre-constructed psychology and behavior mapping relation atlas.
In another aspect, the present invention provides a multi-modal emotion recognition method, including: the voice receiver receives a voice signal sent by a target object; a visual image receiver receiving visual image data about a target object; the first emotion recognition subsystem acquires a first emotion recognition result according to the voice signal; the second emotion recognition subsystem acquires a second emotion recognition result according to the voice signal; the third emotion recognition subsystem acquires a third emotion recognition result according to the visual image data; and the emotion output device determines the emotional state of the target object according to the first emotion identification result, the second emotion identification result and the third emotion identification result.
Further, the first emotion recognition subsystem acquires a first emotion recognition result according to the voice signal, and specifically comprises the steps of extracting acoustic prosody features from the voice signal of the voice receiver; acquiring a first emotion recognition result of the voice signal according to the acoustic rhythm characteristics; the second emotion recognition subsystem acquires a second emotion recognition result according to the voice signal, and specifically comprises the steps of converting the voice signal of the voice receiver into a character sequence; extracting a sentence characteristic value in the character sequence; acquiring a second emotion recognition result of the voice signal according to the sentence characteristic value; the third emotion recognition subsystem acquires a third emotion recognition result according to the visual image data, and specifically comprises the steps of recognizing and tracking face data in the visual image data; recognizing and tracking the whole human body data including the head in the visual image data; extracting face key points in the face data, and acquiring facial expression characteristic values according to the face key points; extracting body action key points in the human body data, and acquiring body action characteristic values according to the body action key points; acquiring a third emotion recognition result of the visual image data according to the facial expression characteristic value and the body action characteristic value; the emotion follower determines the emotional state of the target object according to the first emotion recognition result, the second emotion recognition result and the third emotion recognition result, and specifically comprises the step of determining the emotional state of the target object according to the first emotion recognition result, the second emotion recognition result and the third emotion recognition result and a pre-constructed psychology and behavior mapping relation map.
Further, the acoustic rhythm characteristics include pitch, intensity, timbre, sound spectrum, cepstrum, linear perceptual prediction cepstrum coefficient, root-mean-square intensity, zero crossing rate, spectral flow, spectral centroid, frequency bandwidth, spectral quotient, spectral flatness, spectral slope, spectral sharpness, sound chromaticity, spectral attenuation point, spectral slope, single-frequency overtone, sound probability, sound formant, voice climbing point, and spectral envelope.
The multi-mode emotion recognition system and method provided by the invention are integrated with a character recognition technology, a voice recognition technology and a visual image recognition technology, and human emotion recognition is carried out from a plurality of channels, so that an emotion recognition machine can accurately recognize the emotion of a target object in a man-machine interaction process.
Drawings
FIG. 1 is a block diagram of a multi-modal emotion recognition system provided by an embodiment of the present invention;
FIG. 2 is another block diagram of a multi-modal emotion recognition system provided by an embodiment of the present invention;
FIG. 3 is a flowchart of a multi-modal emotion recognition method provided by an embodiment of the present invention;
FIG. 4 is a flowchart of a multi-modal emotion recognition method according to an embodiment of the present invention.
Detailed Description
The invention is further illustrated by the following specific examples, which, however, are to be construed as merely illustrative, and not limitative of the remainder of the disclosure in any way whatsoever.
Example one
With reference to fig. 1, the multi-modal emotion recognition system provided in this embodiment includes: the system comprises a voice receiver 1, a first emotion recognition subsystem 3, a second emotion recognition subsystem 4, a visual image receiver 2, a third emotion recognition subsystem 5 and an emotion output device 6; a voice receiver 1 for receiving a voice signal emitted from a target object; a visual image receiver 2 for receiving visual image data on a target object; the first emotion recognition subsystem 3 is used for acquiring a first emotion recognition result according to the voice signal; the second emotion recognition subsystem 4 is used for acquiring a second emotion recognition result according to the voice signal; the third emotion recognition subsystem 5 is used for acquiring a third emotion recognition result according to the visual image data; and the emotion outputter 6 is used for determining the emotional state of the target object according to the first emotion recognition result, the second emotion recognition result and the third emotion recognition result.
Preferably, as shown in fig. 2, the first emotion recognition subsystem 3 specifically includes an emotion saliency divider 301, a first emotion recognizer 302; an emotion saliency divider 301 for extracting acoustic prosody features from the speech signal of the speech receiver 1; a first emotion recognizer 302, configured to obtain a first emotion recognition result of the speech signal according to the acoustic prosody feature; the second emotion recognition subsystem 4 specifically includes a speech recognizer 401, a sentence feature value extractor 402, and a second emotion recognizer 403; a speech recognizer 401 for converting the speech signal of the speech receiver 1 into a sequence of words; a sentence characteristic value extractor 402 for extracting a sentence characteristic value in the text sequence; a second emotion recognizer 403, configured to obtain a second emotion recognition result of the speech signal according to the sentence feature value; the third emotion recognition subsystem 5 specifically includes a face recognition tracker 501, a human body recognition tracker 503, a facial expression feature extractor 502, a body action feature extractor 504, and a third emotion recognizer 505; a face recognition tracker 501 for recognizing and tracking face data in the visual image data; a human body recognition tracker 503 for recognizing and tracking the entire human body data including the head in the visual image data; a facial expression feature extractor 502, configured to extract facial key points in the face data, and obtain facial expression feature values according to the facial key points; a body motion feature extractor 504, configured to extract body motion key points in the human body data, and obtain a body motion feature value according to the body motion key points; a third emotion recognizer 505, configured to obtain a third emotion recognition result of the visual image data according to the facial expression feature value and the body motion feature value; and the emotion outputter 6 is used for determining the emotional state of the target object according to the first emotion recognition result, the second emotion recognition result, the third emotion recognition result and a pre-constructed psychology and behavior mapping relation map.
The multi-mode emotion recognition system provided by the embodiment of the invention integrates a character recognition technology, a voice recognition technology and a visual image recognition technology, and simultaneously carries out human emotion recognition from a plurality of channels, so that an emotion recognition machine can accurately recognize the emotion of a target object in the human-computer interaction process.
It should be noted that the psychology mapping relationship map mentioned in this embodiment is a relationship library pre-constructed according to behavioral psychology relationships, and is substantially a mapping relationship map from a human behavioral representation to a human real emotion.
Further preferably, the first emotion recognizer 302 obtains the first emotion recognition result of the speech signal according to the acoustic prosody feature, and specifically includes that the first emotion recognizer 302 substitutes the acoustic prosody feature into a pre-constructed brain-like machine learning model to obtain a neural-like speech feature, and substitutes the neural-like speech feature into a pre-stored emotion model to obtain the first emotion of the speech signal and the first emotion recognition confidence corresponding to the first emotion.
Specifically, the acoustic prosodic features include pitch, intensity, timbre, sound spectrum, cepstrum, linear perceptual prediction cepstrum coefficient, root-mean-square intensity, zero crossing rate, spectral flow, spectral centroid, frequency bandwidth, spectral quotient, spectral flatness, spectral slope, spectral sharpness, sound chromaticity, spectral attenuation point, spectral slope, single-frequency overtone, sound probability, sound formant, speech climbing point, spectral envelope.
Further preferably, the sentence characteristic value extractor 402 extracts the sentence characteristic values in the text sequence, specifically including performing word segmentation processing on the text sequence to obtain word segmentation characteristic values, performing word category analysis on the text sequence to obtain word category characteristic values, and performing sentence pattern syntactic analysis on the text sequence to obtain sentence pattern syntactic characteristic values; the second emotion recognizer 403 obtains a second emotion recognition result of the speech signal according to the sentence feature value, and specifically includes the second emotion recognizer 403 inputting the participle feature value, the word category feature value, and the sentence pattern syntactic feature value in the sentence feature value into a pre-constructed text emotion recognition model to obtain a second emotion of the speech signal and a second emotion recognition confidence corresponding to the second emotion.
Further preferably, the third emotion recognizer 505 obtains a third emotion recognition result of the visual image data according to the facial expression feature value and the body motion feature value, and specifically includes the third emotion recognizer 505, which substitutes the facial expression feature value and the body motion feature value into a pre-constructed emotion classifier to obtain a third emotion of the visual image data and a third emotion recognition confidence corresponding to the third emotion.
In this embodiment, the facial expressions are combined with body movements to perform emotion recognition on the visual image data, so that the emotion recognition rate can be improved. For example, when a person is standing out of the chest, the lower jaw is raised, and the face belt is smiling, the corresponding feeling is pride, but when only the chest and lower jaw are raised, or only smile is generated, the feeling of pride cannot be judged. In addition, the present embodiment combines the psychological research results of Paul Ekman, and adopts a deep learning model of facial expressions and body movements to distinguish human emotions.
Further preferably, the emotion follower 6 determines the emotional state of the target object according to the first emotion recognition result, the second emotion recognition result, the third emotion recognition result and the pre-constructed psychology mapping relationship map, and specifically includes that when any one of the first emotion recognition confidence of the first emotion recognition result, the second emotion recognition confidence of the second emotion recognition result and the third emotion recognition confidence of the third emotion recognition result is greater than or equal to a set threshold, the emotion corresponding to the emotion recognition confidence is determined as the emotional state of the target object; when the first emotion recognition confidence of the first emotion recognition result, the second emotion recognition confidence of the second emotion recognition result and the third emotion recognition confidence of the third emotion recognition result are all smaller than a set threshold value, respectively calculating emotion labels for the first emotion of the first emotion recognition result, the second emotion of the second emotion recognition result and the third emotion of the third emotion recognition result according to a preset weight rule to obtain a first emotion label, a second emotion label and a third emotion label; and determining the emotional state of the target object according to the first emotional tag, the second emotional tag and the third emotional tag and according to a pre-constructed psychology and behavior mapping relation atlas.
Example two
With reference to fig. 3, an embodiment of the present invention provides a multi-modal emotion recognition method, including:
step S1: the voice receiver 1 receives a voice signal sent by a target object;
step S2: the visual image receiver 2 receives visual image data on a target object;
step S3: the first emotion recognition subsystem 3 acquires a first emotion recognition result according to the voice signal;
step S4: the second emotion recognition subsystem 4 acquires a second emotion recognition result according to the voice signal;
step S5: the third emotion recognition subsystem 5 acquires a third emotion recognition result according to the visual image data;
step S6: and the emotion outputter 6 determines the emotional state of the target object according to the first emotion recognition result, the second emotion recognition result and the third emotion recognition result.
Preferably, as shown in fig. 4, the first emotion recognition subsystem 3 obtains a first emotion recognition result according to the voice signal, specifically including,
step S3.1: extracting acoustic prosodic features from a speech signal of the speech receiver 1;
step S3.2: acquiring a first emotion recognition result of the voice signal according to the acoustic rhythm characteristics;
the second emotion recognition subsystem 4 obtains a second emotion recognition result according to the voice signal, which specifically includes,
step S4.1: converting the speech signal of the speech receiver 1 into a sequence of words;
step S4.2: extracting a sentence characteristic value in the character sequence;
step S4.3: acquiring a second emotion recognition result of the voice signal according to the sentence characteristic value;
the third emotion recognition subsystem 5 obtains a third emotion recognition result according to the visual image data, and specifically includes,
step S5.1: recognizing and tracking face data in the visual image data;
step S5.2: recognizing and tracking the whole human body data including the head in the visual image data;
step S5.3: extracting face key points in the face data, and acquiring facial expression characteristic values according to the face key points;
step S5.4: extracting body action key points in the human body data, and acquiring body action characteristic values according to the body action key points;
step S5.5: acquiring a third emotion recognition result of the visual image data according to the facial expression characteristic value and the body action characteristic value;
the emotion output unit 6 determines the emotional state of the target object according to the first emotion recognition result, the second emotion recognition result and the third emotion recognition result, which specifically includes,
step S6.1: and determining the emotional state of the target object according to the first emotion recognition result, the second emotion recognition result, the third emotion recognition result and a pre-constructed psychology and behavior mapping relation map.
The multi-mode emotion recognition method provided by the embodiment of the invention integrates a character recognition technology, a voice recognition technology and a visual image recognition technology, and simultaneously carries out human emotion recognition from a plurality of channels, so that an emotion recognition machine can accurately recognize the emotion of a target object in the human-computer interaction process.
It should be noted that the psychology mapping relationship map mentioned in this embodiment is a relationship library pre-constructed according to behavioral psychology relationships, and is substantially a mapping relationship map from a human behavioral representation to a human real emotion.
Specifically, the acoustic prosodic features include pitch, intensity, timbre, sound spectrum, cepstrum, linear perceptual prediction cepstrum coefficient, root-mean-square intensity, zero crossing rate, spectral flow, spectral centroid, frequency bandwidth, spectral quotient, spectral flatness, spectral slope, spectral sharpness, sound chromaticity, spectral attenuation point, spectral slope, single-frequency overtone, sound probability, sound formant, speech climbing point, spectral envelope.
Although the present invention has been described to a certain extent, it is apparent that appropriate changes in the respective conditions may be made without departing from the spirit and scope of the present invention. It is to be understood that the invention is not limited to the described embodiments, but is to be accorded the scope consistent with the claims, including equivalents of each element described.

Claims (5)

1. A multi-modal emotion recognition system, comprising: the system comprises a voice receiver, a first emotion recognition subsystem, a second emotion recognition subsystem, a visual image receiver, a third emotion recognition subsystem and an emotion output device;
the voice receiver is used for receiving a voice signal sent by a target object;
the visual image receiver is used for receiving visual image data about the target object;
the first emotion recognition subsystem is used for acquiring a first emotion recognition result according to the voice signal;
the second emotion recognition subsystem is used for acquiring a second emotion recognition result according to the voice signal;
the third emotion recognition subsystem is used for acquiring a third emotion recognition result according to the visual image data;
the emotion outputter is used for determining the emotional state of the target object according to the first emotion recognition result, the second emotion recognition result and the third emotion recognition result;
the first emotion recognition subsystem specifically comprises an emotion significance divider and a first emotion recognizer;
the emotion significance decollator is used for extracting acoustic prosodic features from the voice signal of the voice receiver;
the first emotion recognizer is used for acquiring the first emotion recognition result of the voice signal according to the acoustic prosodic feature;
the second emotion recognition subsystem specifically comprises a voice recognizer, a sentence characteristic value extractor and a second emotion recognizer;
the voice recognizer is used for converting the voice signal of the voice receiver into a character sequence;
the sentence characteristic value extractor is used for extracting the sentence characteristic values in the character sequence;
the second emotion recognizer is used for acquiring a second emotion recognition result of the voice signal according to the statement feature value;
the third emotion recognition subsystem specifically comprises a face recognition tracker, a human body recognition tracker, a facial expression feature extractor, a body action feature extractor and a third emotion recognizer;
the face recognition tracker is used for recognizing and tracking face data in the visual image data;
the human body identification tracker is used for identifying and tracking the whole human body data including the head in the visual image data;
the facial expression feature extractor is used for extracting facial key points in the face data and acquiring facial expression feature values according to the facial key points;
the body motion characteristic extractor is used for extracting body motion key points in the human body data and acquiring body motion characteristic values according to the body motion key points;
the third emotion recognizer is used for acquiring a third emotion recognition result of the visual image data according to the facial expression characteristic value and the body motion characteristic value;
the emotion follower is used for determining the emotional state of the target object according to the first emotion recognition result, the second emotion recognition result, the third emotion recognition result and a pre-constructed psychology mapping relation map, wherein the psychology mapping relation map is a relation library pre-constructed according to behavioral psychology relations, and is essentially a mapping relation map from the behavioral expression of a person to the real emotion of the person;
the sentence characteristic value extractor extracts the sentence characteristic values in the character sequence, specifically comprising,
performing word segmentation processing on the character sequence to obtain word segmentation characteristic values, performing word category analysis on the character sequence to obtain word category characteristic values, and performing sentence pattern syntactic analysis on the character sequence to obtain sentence pattern syntactic characteristic values;
the second emotion recognizer obtains the second emotion recognition result of the voice signal according to the sentence characteristic value, which specifically comprises,
the second emotion recognizer inputs the word segmentation feature values, the word category feature values and the sentence pattern syntactic feature values in the sentence feature values into a pre-constructed text emotion recognition model to acquire a second emotion of the voice signal and a second emotion recognition confidence corresponding to the second emotion;
the first emotion recognizer obtains the first emotion recognition result of the voice signal according to the acoustic prosodic feature, and specifically comprises,
and the first emotion recognizer substitutes the acoustic prosodic features into a pre-constructed brain-like machine learning model to acquire nerve-like voice features, and substitutes the nerve-like voice features into a pre-stored emotion model to acquire a first emotion of the voice signal and a first emotion recognition confidence corresponding to the first emotion.
2. The system of claim 1, wherein the acoustic prosodic features include pitch, intensity, timbre, sound spectrum, cepstrum, linear perceptual prediction cepstral coefficients, root-mean-square intensity, zero crossing rate, spectral flow, spectral centroid, frequency bandwidth, spectral quotient, spectral flatness, spectral slope, spectral sharpness, sound chromaticity, spectral attenuation point, spectral slope, single frequency overtone, sound probability, sound formant, speech climb point, spectral envelope.
3. The multi-modal emotion recognition system of claim 1, wherein the third emotion recognizer obtains a third emotion recognition result of the visual image data based on the facial expression feature value and the body motion feature value, and in particular comprises,
and the third emotion recognizer substitutes the facial expression characteristic value and the body action characteristic value into a pre-constructed emotion classifier to obtain a third emotion of the visual image data and a third emotion recognition confidence corresponding to the third emotion.
4. The multimodal emotion recognition system of any of claims 1 to 3, wherein the emotion follower determines the emotional state of the target object based on the first emotion recognition result, the second emotion recognition result, the third emotion recognition result, and a pre-constructed psycho-behavioral mapping relation map, and in particular comprises,
judging the emotion corresponding to the emotion recognition confidence coefficient as the emotion state of the target object when any one of the first emotion recognition confidence coefficient of the first emotion recognition result, the second emotion recognition confidence coefficient of the second emotion recognition result and the third emotion recognition confidence coefficient of the third emotion recognition result is greater than or equal to a set threshold;
when the first emotion recognition confidence of the first emotion recognition result, the second emotion recognition confidence of the second emotion recognition result and the third emotion recognition confidence of the third emotion recognition result are all smaller than a set threshold value, respectively calculating emotion labels for the first emotion of the first emotion recognition result and the second emotion of the second emotion recognition result and the third emotion of the third emotion recognition result according to a preset weight rule to obtain a first emotion label, a second emotion label and a third emotion label;
and determining the emotional state of the target object according to the first emotional label, the second emotional label, the third emotional label and the pre-constructed psychology and behavior mapping relation atlas.
5. A multi-modal emotion recognition method, comprising:
the voice receiver receives a voice signal sent by a target object;
a visual image receiver receiving visual image data about the target object;
the first emotion recognition subsystem acquires a first emotion recognition result according to the voice signal;
the second emotion recognition subsystem acquires a second emotion recognition result according to the voice signal;
the third emotion recognition subsystem acquires a third emotion recognition result according to the visual image data;
the emotion output device determines the emotional state of the target object according to the first emotion recognition result, the second emotion recognition result and the third emotion recognition result;
the first emotion recognition subsystem acquires a first emotion recognition result according to the voice signal, and specifically comprises,
extracting acoustic prosodic features from the speech signal of the speech receiver;
acquiring the first emotion recognition result of the voice signal according to the acoustic prosodic feature;
the second emotion recognition subsystem acquires a second emotion recognition result according to the voice signal, and specifically comprises,
converting the speech signal of the speech receiver into a sequence of words;
extracting a sentence characteristic value in the character sequence;
acquiring the second emotion recognition result of the voice signal according to the statement characteristic value;
the third emotion recognition subsystem acquires a third emotion recognition result according to the visual image data, and specifically comprises,
recognizing and tracking face data in the visual image data;
identifying and tracking whole human body data including a head in the visual image data;
extracting face key points in the face data, and acquiring facial expression characteristic values according to the face key points;
extracting body action key points in the human body data, and acquiring body action characteristic values according to the body action key points;
acquiring a third emotion recognition result of the visual image data according to the facial expression characteristic value and the body action characteristic value;
the emotion follower determines the emotional state of the target object according to the first emotion recognition result, the second emotion recognition result and the third emotion recognition result, which specifically comprises,
determining the emotional state of the target object according to the first emotion recognition result, the second emotion recognition result, the third emotion recognition result and a pre-constructed psychology and behavior mapping relation atlas; the psychology and behavior mapping relation map is a relation library constructed in advance according to behavior and psychology relations, and the essence of the psychology and behavior mapping relation map is a mapping relation map from behavior appearance of a person to real emotion of the person;
the extracting of the sentence characteristic values in the text sequence specifically comprises,
performing word segmentation processing on the character sequence to obtain word segmentation characteristic values, performing word category analysis on the character sequence to obtain word category characteristic values, and performing sentence pattern syntactic analysis on the character sequence to obtain sentence pattern syntactic characteristic values;
obtaining the second emotion recognition result of the speech signal according to the sentence characteristic value, specifically including,
inputting the word segmentation characteristic values, the word category characteristic values and the sentence pattern syntactic characteristic values in the sentence characteristic values into a pre-constructed text emotion recognition model to acquire a second emotion of the voice signal and a second emotion recognition confidence corresponding to the second emotion;
obtaining the first emotion recognition result of the voice signal according to the acoustic prosody feature, specifically comprising,
substituting the acoustic prosodic features into a pre-constructed brain-like machine learning model to obtain nerve-like voice features, and substituting the nerve-like voice features into a pre-stored emotion model to obtain a first emotion of the voice signal and a first emotion recognition confidence corresponding to the first emotion;
the acoustic rhythm characteristics comprise pitch, intensity, tone quality, sound spectrum, cepstrum, linear perception prediction cepstrum coefficient, root-mean-square intensity, zero crossing rate, frequency spectrum flow, frequency spectrum mass center, frequency bandwidth, frequency spectrum quotient, frequency spectrum flatness, frequency spectrum inclination, frequency spectrum sharpness, sound chromaticity, frequency spectrum attenuation point, frequency spectrum inclination, single-frequency overtone, sound probability, sound formant, voice climbing point and frequency spectrum envelope.
CN201610912302.5A 2016-10-19 2016-10-19 Multi-mode emotion recognition system and method Active CN106503646B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610912302.5A CN106503646B (en) 2016-10-19 2016-10-19 Multi-mode emotion recognition system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610912302.5A CN106503646B (en) 2016-10-19 2016-10-19 Multi-mode emotion recognition system and method

Publications (2)

Publication Number Publication Date
CN106503646A CN106503646A (en) 2017-03-15
CN106503646B true CN106503646B (en) 2020-07-10

Family

ID=58294258

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610912302.5A Active CN106503646B (en) 2016-10-19 2016-10-19 Multi-mode emotion recognition system and method

Country Status (1)

Country Link
CN (1) CN106503646B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US12080285B2 (en) 2020-04-22 2024-09-03 Google Llc Semi-delegated calling by an automated assistant on behalf of human participant

Families Citing this family (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107194151B (en) * 2017-04-20 2020-04-03 华为技术有限公司 Method for determining emotion threshold value and artificial intelligence equipment
CN107092895A (en) * 2017-05-09 2017-08-25 重庆邮电大学 A kind of multi-modal emotion identification method based on depth belief network
CN108664932B (en) * 2017-05-12 2021-07-09 华中师范大学 Learning emotional state identification method based on multi-source information fusion
CN107180236B (en) * 2017-06-02 2020-02-11 北京工业大学 Multi-modal emotion recognition method based on brain-like model
CN109254669B (en) * 2017-07-12 2022-05-10 腾讯科技(深圳)有限公司 Expression picture input method and device, electronic equipment and system
CN107943299B (en) * 2017-12-07 2022-05-06 上海智臻智能网络科技股份有限公司 Emotion presenting method and device, computer equipment and computer readable storage medium
US10783329B2 (en) 2017-12-07 2020-09-22 Shanghai Xiaoi Robot Technology Co., Ltd. Method, device and computer readable storage medium for presenting emotion
CN109903392B (en) 2017-12-11 2021-12-31 北京京东尚科信息技术有限公司 Augmented reality method and apparatus
CN108091323B (en) * 2017-12-19 2020-10-13 想象科技(北京)有限公司 Method and apparatus for emotion recognition from speech
CN108108849A (en) * 2017-12-31 2018-06-01 厦门大学 A kind of microblog emotional Forecasting Methodology based on Weakly supervised multi-modal deep learning
CN110085211B (en) * 2018-01-26 2021-06-29 上海智臻智能网络科技股份有限公司 Voice recognition interaction method and device, computer equipment and storage medium
CN108899050B (en) * 2018-06-14 2020-10-02 南京云思创智信息科技有限公司 Voice signal analysis subsystem based on multi-modal emotion recognition system
CN109241912B (en) * 2018-09-08 2020-08-07 河南大学 Target identification method based on brain-like cross-media intelligence and oriented to unmanned autonomous system
US20200099634A1 (en) * 2018-09-20 2020-03-26 XRSpace CO., LTD. Interactive Responding Method and Computer System Using the Same
CN109829363A (en) * 2018-12-18 2019-05-31 深圳壹账通智能科技有限公司 Expression recognition method, device, computer equipment and storage medium
CN110033029A (en) * 2019-03-22 2019-07-19 五邑大学 A kind of emotion identification method and device based on multi-modal emotion model
CN110287912A (en) * 2019-06-28 2019-09-27 广东工业大学 Method, apparatus and medium are determined based on the target object affective state of deep learning
CN110688499A (en) * 2019-08-13 2020-01-14 深圳壹账通智能科技有限公司 Data processing method, data processing device, computer equipment and storage medium
CN110910903B (en) * 2019-12-04 2023-03-21 深圳前海微众银行股份有限公司 Speech emotion recognition method, device, equipment and computer readable storage medium
US11630999B2 (en) * 2019-12-19 2023-04-18 Dish Network Technologies India Private Limited Method and system for analyzing customer calls by implementing a machine learning model to identify emotions
CN113128284A (en) * 2019-12-31 2021-07-16 上海汽车集团股份有限公司 Multi-mode emotion recognition method and device

Family Cites Families (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2007092795A2 (en) * 2006-02-02 2007-08-16 Neuric Technologies, Llc Method for movie animation
KR101840644B1 (en) * 2011-05-31 2018-03-22 한국전자통신연구원 System of body gard emotion cognitive-based, emotion cognitive device, image and sensor controlling appararus, self protection management appararus and method for controlling the same
CN102298694A (en) * 2011-06-21 2011-12-28 广东爱科数字科技有限公司 Man-machine interaction identification system applied to remote information service
CN102881284B (en) * 2012-09-03 2014-07-09 江苏大学 Unspecific human voice and emotion recognition method and system
US9031293B2 (en) * 2012-10-19 2015-05-12 Sony Computer Entertainment Inc. Multi-modal sensor based emotion recognition and emotional interface
CN103123619B (en) * 2012-12-04 2015-10-28 江苏大学 Based on the multi-modal Cooperative Analysis method of the contextual visual speech of emotion
CN103456314B (en) * 2013-09-03 2016-02-17 广州创维平面显示科技有限公司 A kind of emotion identification method and device
CN105334743B (en) * 2015-11-18 2018-10-26 深圳创维-Rgb电子有限公司 A kind of intelligent home furnishing control method and its system based on emotion recognition
CN105739688A (en) * 2016-01-21 2016-07-06 北京光年无限科技有限公司 Man-machine interaction method and device based on emotion system, and man-machine interaction system
CN105975594A (en) * 2016-05-09 2016-09-28 清华大学 Sentiment classification method and device based on combined feature vector and SVM[perf] (Support Vector Machine)
CN105976809B (en) * 2016-05-25 2019-12-17 中国地质大学(武汉) Identification method and system based on speech and facial expression bimodal emotion fusion
CN105869657A (en) * 2016-06-03 2016-08-17 竹间智能科技(上海)有限公司 System and method for identifying voice emotion

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US12080285B2 (en) 2020-04-22 2024-09-03 Google Llc Semi-delegated calling by an automated assistant on behalf of human participant

Also Published As

Publication number Publication date
CN106503646A (en) 2017-03-15

Similar Documents

Publication Publication Date Title
CN106503646B (en) Multi-mode emotion recognition system and method
US10433052B2 (en) System and method for identifying speech prosody
Jing et al. Prominence features: Effective emotional features for speech emotion recognition
US7369991B2 (en) Speech recognition system, speech recognition method, speech synthesis system, speech synthesis method, and program product having increased accuracy
CN116547746A (en) Dialog management for multiple users
EP3198589B1 (en) Method and apparatus to synthesize voice based on facial structures
Mariooryad et al. Compensating for speaker or lexical variabilities in speech for emotion recognition
CN103996155A (en) Intelligent interaction and psychological comfort robot service system
CN107972028B (en) Man-machine interaction method and device and electronic equipment
KR20210155401A (en) Speech synthesis apparatus for evaluating the quality of synthesized speech using artificial intelligence and method of operation thereof
Chandrasekar et al. Automatic speech emotion recognition: A survey
Sethu et al. Speech based emotion recognition
KR102607373B1 (en) Apparatus and method for recognizing emotion in speech
JP2018159788A (en) Information processing device, method and program
CN112466287B (en) Voice segmentation method, device and computer readable storage medium
CN104538025A (en) Method and device for converting gestures to Chinese and Tibetan bilingual voices
Zbancioc et al. A study about the automatic recognition of the anxiety emotional state using Emo-DB
CN117352000A (en) Speech classification method, device, electronic equipment and computer readable medium
Huang et al. A review of automated intelligibility assessment for dysarthric speakers
CN116580691A (en) Speech synthesis method, speech synthesis device, electronic device, and storage medium
Eyben et al. Audiovisual vocal outburst classification in noisy acoustic conditions
CN116543797A (en) Emotion recognition method and device based on voice, electronic equipment and storage medium
Mishra et al. Real time emotion detection from speech using Raspberry Pi 3
Bojanić et al. Application of dimensional emotion model in automatic emotional speech recognition
Rabiei et al. A methodology for recognition of emotions based on speech analysis, for applications to human-robot interaction. An exploratory study

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant