CN116682168A - Multi-modal expression recognition method, medium and system - Google Patents

Multi-modal expression recognition method, medium and system Download PDF

Info

Publication number
CN116682168A
CN116682168A CN202310973040.3A CN202310973040A CN116682168A CN 116682168 A CN116682168 A CN 116682168A CN 202310973040 A CN202310973040 A CN 202310973040A CN 116682168 A CN116682168 A CN 116682168A
Authority
CN
China
Prior art keywords
expression
sequence
features
voice
gesture
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310973040.3A
Other languages
Chinese (zh)
Other versions
CN116682168B (en
Inventor
洪惠群
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yango University
Original Assignee
Yango University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yango University filed Critical Yango University
Priority to CN202310973040.3A priority Critical patent/CN116682168B/en
Publication of CN116682168A publication Critical patent/CN116682168A/en
Application granted granted Critical
Publication of CN116682168B publication Critical patent/CN116682168B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/174Facial expression recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/18Eye characteristics, e.g. of the iris
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Human Computer Interaction (AREA)
  • Computing Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Acoustics & Sound (AREA)
  • Medical Informatics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Signal Processing (AREA)
  • Databases & Information Systems (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Social Psychology (AREA)
  • Ophthalmology & Optometry (AREA)
  • Psychiatry (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a multi-modal expression recognition method, medium and system, belonging to the expression recognition technical field, wherein the method comprises the steps of obtaining multi-modal characteristics of a recognition object from video; including facial motion features, eye motion features, speech features, and gesture motion features; acquiring a plurality of expressions according to facial action characteristics to form a first expression sequence; correcting the first expression sequence by utilizing eyeball action characteristics to obtain a second expression sequence; scoring the second expression sequence using the speech features and gesture motion features; identifying the expression with the score lower than a score threshold in the second expression sequence by using a pre-trained multi-mode expression identification model, and updating the second expression sequence by using the identified expression to obtain a third expression sequence; carrying out single expression and complex expression division on the third expression sequence to form an expression segment sequence containing at least one expression segment; and outputting the expression segment sequence.

Description

Multi-modal expression recognition method, medium and system
Technical Field
The invention belongs to the technical field of expression recognition, and particularly relates to a multi-modal expression recognition method, medium and system.
Background
With the rapid development of artificial intelligence, machine learning and other technologies, expression recognition has become an important research direction in the fields of computer vision and pattern recognition. The expression is an important mode of human communication, and the emotion and the intention of the person can be better understood and interpreted by identifying and understanding the expression of the person, so that the method has wide application value in the fields of social robots, video conferences, telemedicine, intelligent monitoring and the like. Expression recognition technology generally recognizes and understands the emotional state of a person by analyzing and understanding multi-modal information such as facial movements, eye movements, voices, and gestures of the person. The conventional expression recognition method is mainly based on single-modality information, such as facial motion or voice information analysis alone to perform expression recognition. However, this approach often fails to accurately identify and understand the true emotional state of a person due to the multi-modal nature of the emotional expression of the person being ignored. In order to solve this problem, researchers have recently begun to attempt expression recognition using multi-modal information to improve the accuracy and robustness of expression recognition.
Most of the existing expression recognition methods adopt a static feature extraction method, namely feature extraction is carried out at a single time point, the method ignores the dynamic process of emotion, and cannot accurately capture emotion changes, namely, the emotion (called an expression segment) in a time period lacks recognition analysis, for example, sun Quan in a three-country speech of a television show in a section of a banquet head soldier who is welted with Liu Beijing, a face of a collar Gu Hua is a one expression segment, and if the emotion of Gu Hua is recognized in frames, the accurate emotion of the user cannot be recognized. Moreover, this approach cannot handle complex emotional states, such as mixed emotions, masked emotions, etc.
Disclosure of Invention
In view of the above, the present invention provides a multi-modal expression recognition method, medium and system, which can solve the technical problems that in the prior art, the dynamic process of emotion is ignored, the change of emotion cannot be accurately captured, and complex emotion state cannot be processed.
The invention is realized in the following way:
the first aspect of the present invention provides a multi-modal expression recognition method, including the steps of:
s10, acquiring multi-modal characteristics of an identification object from video, wherein the multi-modal characteristics comprise facial action characteristics, eyeball action characteristics, voice characteristics and gesture action characteristics;
s20, acquiring a plurality of expressions according to facial action characteristics to form a first expression sequence;
s30, correcting the first expression sequence by utilizing eyeball action characteristics to obtain a second expression sequence;
s40, scoring the second expression sequence by utilizing the voice characteristics and the gesture action characteristics;
s50, recognizing the expression with the score lower than a score threshold in the second expression sequence by using a pre-trained multi-mode expression recognition model, and updating the second expression sequence by using the recognized expression to obtain a third expression sequence;
s60, carrying out single expression and complex expression division on the third expression sequence to form an expression segment sequence containing at least one expression segment;
S70, outputting the expression segment sequence.
Wherein, a single expression refers to that an individual only shows one expression, such as happiness, sadness, surprise, etc., for a period of time. Multiple expressions means that an individual may exhibit two or more expressions at the same time during the same time period, such as happy and surprised, or angry and sad. In actual emotion expression, complex expression is common, and because human emotion is complex, complete expression by only one expression is often impossible.
Based on the technical scheme, the multi-modal expression recognition method can be further improved as follows:
the facial action features, eyeball action features and gesture action features are obtained in the following modes:
performing face recognition, eyeball recognition and gesture recognition on each frame of the video to obtain a facial feature point set describing facial features, an eyeball feature point set describing eyeball features and a gesture feature point set describing gesture features;
and establishing a facial motion feature matrix, an eyeball motion feature matrix and a gesture motion feature matrix according to the time axis by using the facial feature point set, the eyeball feature point set and the gesture feature point set corresponding to each frame as facial motion features, eyeball motion features and gesture motion features.
The voice characteristic is obtained in the following manner:
extracting an audio signal from the video;
preprocessing the extracted audio signal, including noise reduction, gain and other operations, so as to improve the quality of the audio signal;
performing feature extraction on the preprocessed audio signal through a voice feature extractor to obtain basic voice features of the audio signal;
identifying the mouth shape in the video to obtain mouth shape characteristics;
training a neural network, and obtaining voice characteristics by performing mouth shape voice synchronous processing on mouth shape characteristics and basic voice characteristics.
Further, the step of obtaining a plurality of expressions according to facial motion features to form a first expression sequence specifically includes: and recognizing the expression of each frame according to the facial feature point set corresponding to each frame, wherein the expression of each frame comprises expression expressive degrees, and forming a first expression sequence according to a time axis by the obtained expression of each frame.
The step of correcting the first expression sequence by utilizing eyeball action characteristics to obtain a second expression sequence specifically comprises the following steps:
step 1: quantifying eyeball motion characteristics;
step 2: establishing a corresponding relation between eyeball action characteristics and expressions;
Step 3: and correcting the first expression sequence to obtain a second expression sequence.
The step of scoring the second expression sequence by using the voice feature and the gesture motion feature specifically includes:
extracting corresponding voice characteristics and gesture action characteristics of each expression in the second expression sequence, and directly setting the score of the corresponding expression as the highest score if the voice characteristics or the gesture action characteristics do not exist; if the voice feature and the gesture motion feature exist at the same time, the method comprises the following steps:
scoring the voice features by using a voice emotion scoring model to obtain voice emotion scores;
scoring the gesture action features by using a gesture emotion scoring model to obtain gesture emotion scores;
and calculating expression scores by using a weighted average method.
The step of identifying the expression with the score lower than the score threshold in the second expression sequence by utilizing a pre-trained multi-mode expression identification model, and updating the second expression sequence by utilizing the identified expression to obtain a third expression sequence specifically comprises the following steps:
step 1, defining a scoring threshold value;
step 2, traversing each expression in the second expression sequence to obtain expressions with all scores smaller than a score threshold;
And 3, recognizing the expression obtained in the step 2 by using a pre-trained multi-modal expression recognition model to obtain a new expression, setting a scoring threshold value of the new expression as 1, and updating the second expression sequence to obtain a third expression sequence.
The step of dividing the third expression sequence into single expression and complex expression to form an expression segment sequence comprising at least one expression segment specifically comprises the following steps:
firstly, preprocessing a third expression sequence;
then, carrying out single expression and complex expression division on the pretreated expression sequence;
and finally, combining the divided single expression and the complex expression into an expression segment sequence.
A second aspect of the present invention provides a computer readable storage medium having stored therein program instructions which, when executed, are adapted to carry out a multimodal expression recognition method as described above.
A third aspect of the present invention provides a multimodal expression recognition system, comprising the computer readable storage medium described above.
Compared with the prior art, the multi-modal expression recognition method, medium and system provided by the invention have the beneficial effects that: according to the multi-modal expression recognition method, the facial action features, the eyeball action features, the voice features and the gesture action features are fused, so that the expression is accurately recognized, and the recognition accuracy and the recognition efficiency are high.
First, in step S10, the present invention obtains a facial feature point set describing facial features, an eyeball feature point set describing eyeball features, and a gesture feature point set describing gesture features by performing facial recognition, eyeball recognition, and gesture recognition on each frame in a video. The method can comprehensively and thoroughly capture the multi-modal characteristics of the recognition object, and provides rich input information for subsequent expression recognition. By constructing the facial motion feature matrix, the eyeball motion feature matrix and the gesture motion feature matrix, the dynamic features of the recognition object can be clearly and intuitively presented.
Meanwhile, the invention also innovates the acquisition mode of the voice characteristics. In view of the problem that the distance may cause misalignment of the voice image, the method adopts a mouth shape and voice matching mode for alignment. Specifically, an audio signal is extracted from a video, the audio signal is preprocessed to improve its quality, and then feature extraction is performed by a speech feature extractor. And simultaneously, recognizing the mouth shape in the video to obtain mouth shape characteristics. And (3) performing mouth shape voice synchronous processing on the mouth shape characteristics and the basic voice characteristics by training a neural network, so as to obtain the voice characteristics. The method can avoid the problem of misalignment of the voice image, improve the accuracy of voice features and be beneficial to improving the accuracy of expression recognition.
Then, in step S20, the present invention acquires a plurality of expressions according to facial motion characteristics to form a first expression sequence. The process can accurately capture the facial expression change of the recognition object, and provides an important basis for subsequent expression recognition.
In the step S30, the first expression sequence is corrected by utilizing the eyeball action characteristics to obtain a second expression sequence. The action of the eyeball can reflect the internal emotion of the recognition object, so that the accuracy of expression recognition can be further improved by introducing the action characteristics of the eyeball.
In step S40, the second expression sequence is scored using the speech features and the gesture motion features. The scoring process can effectively screen out possible false identifications, so that the accuracy of expression identification is improved.
In the step S50, the expressions with scores lower than the scoring threshold value in the second expression sequence are recognized by utilizing a pre-trained multi-mode expression recognition model, and the second expression sequence is updated by utilizing the recognized expressions, so that a third expression sequence is obtained. The process can effectively correct possible false recognition, and further improves the accuracy of expression recognition.
In step S60, the third expression sequence is subjected to single-expression and complex-expression division to form an expression segment sequence including at least one expression segment. The method can better understand and analyze the expression change of the identification object, provide more detailed and rich expression information, and describe the expression by using the expression segment sequence, thereby realizing the analysis of the dynamic process of the expression and being beneficial to better capturing the emotion change.
Finally, in step S70, the expression segment sequence is output. The method can intuitively present the expression change of the identification object, and is convenient for subsequent analysis and utilization.
In general, the multi-modal expression recognition method disclosed by the invention realizes accurate recognition of expressions by fusing various modal characteristics, can solve the technical problems that in the prior art, the dynamic process of the emotion is ignored, the change of the emotion cannot be accurately captured, and the complex emotion state cannot be processed, has higher recognition accuracy and efficiency, and has stronger practical value and wide application prospect.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments of the present invention will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a flowchart of a multi-modal expression recognition method provided by the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, wherein the terms "first" and "second" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implying any particular number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature. In the description of the present invention, the meaning of "a plurality" is two or more, unless explicitly defined otherwise.
In addition, in the detailed description of the present invention, formulas and pseudo codes are used, and temporary variables for describing sequence numbers, such as i, j, k, are used in the formulas or pseudo codes, and are all intermediate variables in the formulas or pseudo code calculation or operation process.
As shown in fig. 1, a flowchart of a method for identifying a multi-modal expression according to a first aspect of the present invention is provided, the method includes the following steps:
s10, acquiring multi-modal characteristics of an identification object from video, wherein the multi-modal characteristics comprise facial action characteristics, eyeball action characteristics, voice characteristics and gesture action characteristics;
s20, acquiring a plurality of expressions according to facial action characteristics to form a first expression sequence;
s30, correcting the first expression sequence by utilizing eyeball action characteristics to obtain a second expression sequence;
s40, scoring the second expression sequence by utilizing the voice characteristics and the gesture action characteristics;
s50, recognizing the expression with the score lower than a score threshold in the second expression sequence by using a pre-trained multi-mode expression recognition model, and updating the second expression sequence by using the recognized expression to obtain a third expression sequence;
s60, carrying out single expression and complex expression division on the third expression sequence to form an expression segment sequence containing at least one expression segment;
S70, outputting the expression segment sequence.
In the above technical solution, the facial motion feature, the eyeball motion feature and the gesture motion feature are obtained by:
performing face recognition, eyeball recognition and gesture recognition on each frame of the video to obtain a facial feature point set describing facial features, an eyeball feature point set describing eyeball features and a gesture feature point set describing gesture features;
and establishing a facial motion feature matrix, an eyeball motion feature matrix and a gesture motion feature matrix according to the time axis by using the facial feature point set, the eyeball feature point set and the gesture feature point set corresponding to each frame as facial motion features, eyeball motion features and gesture motion features.
In the embodiment of step S10, the video is formed by a plurality of frame images, and each frame image is used as a separate processing unit, and preprocessing, including noise reduction, filtering and enhancement, is initially required for each frame.
For each frame, face recognition is first performed, which typically involves face detection and feature extraction. Each frame can be processed by using the existing face recognition algorithm, such as a deep-learning Convolutional Neural Network (CNN) model and Dlib algorithm, so as to obtain a feature point set of the face, wherein the feature points can describe the positions and shapes of key parts such as the outline, eyes, nose, mouth and the like of the face.
Then, extraction of the eye movement features is performed, which involves detection of the eye and positioning of the pupil. The motion track of the eyeball can be obtained by utilizing the existing eyeball tracking algorithm to form the eyeball motion characteristics.
Then, the voice features are extracted. This requires extracting the audio signal from the video and then extracting the features of the speech using speech processing techniques such as MFCC (Mel Frequency Cepstral Coefficients) and the like.
Finally, extracting gesture motion characteristics. This generally involves the detection of human gestures and the localization of key points. Each frame may be processed using an existing pose estimation algorithm, such as openelse, to obtain a set of feature points describing the pose.
Through the steps, facial motion features, eyeball motion features, voice features and gesture motion features can be obtained from each frame, and then the features are organized according to time sequence to form multi-modal features.
In another embodiment of the invention, it is necessary to further consider whether the voice in the video is coming from the identified object, and the time interval between voice transmission and image, which is to take into account the misalignment of voice and image due to distance.
In the above technical solution, the voice feature acquiring manner is as follows:
extracting an audio signal from a video;
preprocessing the extracted audio signal, including noise reduction, gain and other operations, so as to improve the quality of the audio signal;
performing feature extraction on the preprocessed audio signal through a voice feature extractor to obtain basic voice features of the audio signal;
identifying the mouth shape in the video to obtain mouth shape characteristics;
training a neural network, and obtaining voice characteristics by performing mouth shape voice synchronous processing on mouth shape characteristics and basic voice characteristics.
Specifically, the embodiment of acquiring the speech feature is as follows:
extracting an audio signal from a video: first, an audio signal is extracted from a video using an audio extraction tool. This step may use various existing audio extraction software or tools, such as ffmpeg, etc.
Preprocessing the extracted audio signal: after the audio signal is extracted, the audio signal is preprocessed to improve the quality of the audio signal, including noise reduction, gain and other operations. The noise reduction may use various noise reduction algorithms such as spectral subtraction, wiener filtering, etc.; the purpose of the gain is to adjust the size of the audio signal so that it is more easily identifiable in subsequent processing.
The preprocessed audio signal is subjected to feature extraction through a voice feature extractor: after preprocessing, a voice feature extractor is used for extracting features of the audio signal, so as to obtain basic voice features of the audio signal. The voice features comprise syllables, intonation, pitch, intensity and the like, and the feature extraction can use traditional feature extraction methods such as MFCC, LPCC and the like, and can also use deep learning and the like.
Identifying the mouth shape in the video to obtain mouth shape characteristics: and simultaneously, recognizing the mouth shape in the video to obtain mouth shape characteristics. The mouth shape recognition may use image processing methods such as edge detection, region growing, etc., and deep learning methods such as convolutional neural networks, etc.
Training a neural network, and obtaining voice characteristics by performing mouth shape voice synchronous processing on mouth shape characteristics and basic voice characteristics: finally, training a neural network, inputting port type characteristics and basic voice characteristics, and outputting port type voice characteristics after synchronous processing. The neural network may be trained using a back-propagation algorithm, and the optimization algorithm may use random gradient descent, adam, etc. In this way, the problem of distance-induced misalignment of the voice image can be solved.
In the present invention, a multi-layer perceptron (MLP) will be used as the neural network model. An MLP is a feed forward neural network that includes an input layer, one or more hidden layers, and an output layer.
In the training process of the neural network, training samples are prepared. Training samples are a series of pre-acquired large amounts of data containing both mouth-shaped features and basic speech features used to train a neural network. In the present invention, training samples may be obtained from video, including mouth-shape features and voice features. The mouth shape characteristic can be obtained by identifying the mouth shape in the video, and the voice characteristic can be obtained by processing the audio signal extracted from the video.
Specifically, the input layer of the neural network receives as inputs the mouth shape features and the basic speech features. These features may be a series of numbers, each representing a particular feature. For example, the mouth shape features may include width, height, etc. of lips, and the voice features may include frequency, amplitude, etc. of the audio signal.
The hidden layer is the core of the neural network, and is responsible for processing input data and transmitting the processing result to the output layer. In the hidden layer, each neuron performs a weighted summation of the data it receives and a nonlinear transformation by an activation function.
The output layer receives the output of the hidden layer and converts it into the desired format. In the present invention, the output layer will output a speech feature vector.
Through the training process, the neural network can learn the relation between the mouth shape characteristics and the basic voice characteristics, so that the unknown mouth shape and voice can be accurately predicted.
Further, in the above technical solution, the step of obtaining a plurality of expressions according to facial motion features to form a first expression sequence specifically includes: and recognizing the expression of each frame according to the facial feature point set corresponding to each frame, wherein the expression of each frame comprises expression expressive degrees, and forming a first expression sequence according to a time axis by the obtained expression of each frame.
In step S20, expression features will be extracted from the facial feature point set of each frame using a deep learning technique, particularly a convolutional neural network (Convolutional Neural Networks, CNN), and expression recognition will be performed. The specific implementation mode is as follows:
first, a facial feature point set is input into a pre-trained CNN model. The CNN model is trained from a large number of labeled face images, which has learned how to extract useful expressive features from a set of facial feature points. The structure of the CNN model typically includes multiple convolution layers, pooling layers, and full connection layers. The convolution layer and the pooling layer are responsible for extracting local features with different scales, and the full connection layer integrates the features for final classification.
In the expression recognition process, the facial Feature point set is first converted into Feature map (Feature Maps) by a convolution layer. And then, feature dimension reduction is carried out by using a pooling layer, so that the calculation efficiency and the robustness of the model are improved. Pooling operations typically have both maximum Pooling (Max Pooling) and Average Pooling (Average Pooling), where maximum Pooling is employed; by repeated convolution and pooling operations, the original facial feature point set can be converted into a high-level abstract feature representation, which can effectively capture the expression information of the face. And finally, classifying the extracted features by using a full connection layer to obtain the expression of each frame.
After the expression of each frame is identified, the expression level of the expression needs to be calculated. Expression may be understood as the intensity or degree of clarity of an expression, which reflects the clarity of a human expression. The expressive level of each frame can be calculated using a softmax function, which can map any real number to a (0, 1) interval so that the output can be interpreted as a probability. The softmax function may be expressed as:
wherein, the liquid crystal display device comprises a liquid crystal display device,is the output of the full connection layer, < >>Expression- >Is a degree of expression of (3).
And finally, connecting the expression of each frame with the corresponding expression according to the time sequence to form a first expression sequence.
In this embodiment, the softmax function is used to calculate the expression level of each frame, and in another embodiment of the present invention, the expression level may also be obtained by calculating the distance change between the feature points. Assume that the facial feature point set isWherein->Indicate->Coordinates of the feature points. The expression level can be defined as the change rate of the distance between the feature points, namely:
wherein, the liquid crystal display device comprises a liquid crystal display device,and->Respectively represent +.>Characteristic points and->The feature points are at the coordinates of the reference frame (e.g., the first frame). />Indicate->Characteristic points and->The larger the value of the change rate of the distance between the feature points, the greater the degree of expression change.
The expressions of each frame may then be divided into different categories by a clustering algorithm, such as happy, sad, angry, surprise, etc. For each frame, its distance from each category may be calculated and then divided into categories closest to it, namely:
wherein, the liquid crystal display device comprises a liquid crystal display device,indicate->Expression category of frame,/->Indicate->Individual expression category->Indicate->The% >The mean of the individual features.
Through the steps, the expression category and the expression degree of each frame can be obtained, and then the first expression sequence is formed according to a time axis. In order to reduce the influence of noise, a sliding window method may be used to smooth the expression sequence, that is:
wherein, the liquid crystal display device comprises a liquid crystal display device,indicate->Expression category after smoothing of frame, +.>Indicating the size of the sliding window, +.>Indicating an indicating function, i.e. when the condition in brackets is satisfied, its value is 1, otherwise 0; />Is a temporary variable, wherein ∈>;/>Representing a temporary variable for choosing the compliance +.>To->Between->
Through the steps, the expression category and the expression degree of each frame can be obtained, and then the first expression sequence is formed according to a time axis.
In the above technical solution, the step of correcting the first expression sequence by using the eyeball action feature to obtain the second expression sequence specifically includes:
step 1: quantifying eyeball motion characteristics;
step 2: establishing a corresponding relation between eyeball action characteristics and expressions;
step 3: and correcting the first expression sequence to obtain a second expression sequence.
The specific embodiment of step S30 is as follows:
In the expression recognition process, the eyeball action characteristics are a very important part, because the eyeball movement and the expression of people are closely related. For example, when a person is surprised or afraid, the eyes can become large and the ball can look upward; when people feel tired or boring, eyes become tired, and eyeballs look downward. Therefore, by utilizing the action characteristics of the eyeball, the first expression sequence can be corrected, and a more accurate second expression sequence is obtained.
Step 1: quantifying eyeball motion characteristics
We first need to quantify the eye movement characteristics to convert the eye movement into a numerical form. Specifically, we have the direction and magnitude of eye movement as the main quantization parameters.
(1) Calculating the direction of eye movement
We can calculate the direction of eye movement from the change in the set of eye feature points. The specific calculation method is that for each frame, the mass center of the eyeball characteristic point is calculated, and then the movement direction of the mass center between two continuous frames is calculated. This can be expressed by the following formula:
wherein, the liquid crystal display device comprises a liquid crystal display device,indicate->Direction of eye movement of frame,/>And->Respectively represent +. >Frame eyeball feature point centroid +.>Coordinate sum->Coordinates.
(2) Calculating the amplitude of eye movement
We can calculate the magnitude of eye movement from the change in the set of eye feature points. The specific calculation method is that for each frame, a covariance matrix of the eyeball characteristic point set is calculated, and then the sum of characteristic values of the covariance matrix is calculated, wherein the values can reflect the amplitude of eyeball motion. This can be expressed by the following formula:
wherein, the liquid crystal display device comprises a liquid crystal display device,indicate->Eye movement amplitude of frame,/->Indicate->Frame eye feature point set covariance matrix +.>And characteristic values.
Step 2: establishing a corresponding relation between eyeball action characteristics and expressions
We need to establish correspondence between eye movement characteristics and expressions. We can set a threshold for the eye movement characteristics and when the eye movement characteristics exceed this threshold we consider that the expression is present. This threshold can be determined by a machine learning method or by a large amount of training data, training a classifier, inputting as eye motion characteristics, outputting as expressions, and then determining the threshold by a cross-validation method.
Step 3: correcting the first expression sequence
We need to make corrections to the first expression sequence. Specifically, for each expression in the first expression sequence, we calculate the corresponding eyeball motion feature, then judge whether the expression exists or not through the threshold value, and if not, we delete the expression from the first expression sequence to obtain the second expression sequence.
Through the steps, the first expression sequence can be corrected by utilizing the eyeball action characteristics, and the second expression sequence is obtained.
In the above technical solution, the step of scoring the second expression sequence by using the voice feature and the gesture motion feature specifically includes:
extracting the corresponding voice feature and gesture action feature of each expression in the second expression sequence, and directly setting the score of the corresponding expression as the highest score if the voice feature or gesture action feature does not exist; if the voice feature and the gesture motion feature exist at the same time, the method comprises the following steps:
scoring the voice features by using a voice emotion scoring model to obtain voice emotion scores;
scoring the gesture action features by using a gesture emotion scoring model to obtain gesture emotion scores;
And calculating expression scores by using a weighted average method.
The implementation of step S40 mainly includes scoring the second expression sequence using the speech feature and the gesture motion feature. The aim of the step is to score the expression sequence by comprehensively considering the multi-modal characteristics, so that the expression can be accurately identified and judged.
First, scoring for speech featuresA speech emotion scoring model is adopted. The model is obtained through supervised learning training, and the training data is a manually marked voice emotion data set. In this model, speech features are taken as input and speech emotion tags are taken as output. The speech emotion scoring model may be expressed as a functionFor each speech feature +.>A speech emotion score can be obtained>
Secondly, scoring the gesture motion features by adopting a gesture emotion scoring model. The model is also obtained through supervised learning training, and the training data is a manually marked gesture emotion data set. In this model, the gesture motion feature is taken as input and the gesture emotion tag is taken as output. The posing emotion scoring model can be expressed as a function For each gesture motion feature +.>Can obtain a posture emotion score +.>
Then, the voice emotion score and the gesture emotion score are required to be integrated to obtain a final expression score. If the voice feature or the gesture action feature does not exist, the score of the corresponding expression is directly set to be the highest score; if the voice feature and the gesture motion feature exist at the same time, the method comprises the following steps: the method of weighted average is adopted here, namelyWherein->And->Is a weight for adjusting the degree of influence of the speech emotion score and the gesture emotion score. The setting of these two weights needs to be adjusted according to the actual situation, e.g. if speech emotion is considered more important than gesture emotion, then +.>
Finally, for each expression in the second sequence of expressionsCan obtain an expression score +.>. Thus, scoring of the second expression sequence is completed.
In the above technical solution, the step of identifying the expression with the score lower than the score threshold in the second expression sequence by using a pre-trained multi-mode expression identification model, and updating the second expression sequence by using the identified expression to obtain the third expression sequence specifically includes:
Step 1, defining a scoring threshold value;
step 2, traversing each expression in the second expression sequence to obtain expressions with all scores smaller than a score threshold;
and 3, recognizing the expression obtained in the step 2 by using a pre-trained multi-modal expression recognition model to obtain a new expression, setting a scoring threshold value of the new expression as 1, and updating the second expression sequence to obtain a third expression sequence.
In step S50, the expression with the score lower than the score threshold in the second expression sequence is identified by using a pre-trained multi-mode expression identification model, and the second expression sequence is updated by using the identified expression, so as to obtain a third expression sequence. The implementation of this step mainly comprises the following steps:
first, a scoring threshold is definedThis is a preset value for judging the recognition effect of the expression. If the score of an expression is lower than +.>And considering that the expression is bad in recognition effect, and re-recognition is needed by utilizing a pre-trained multi-mode expression recognition model.
Then, traversing each expression in the second expression sequence if the score of the expression is lower than the scoreAnd inputting the multi-modal expression recognition model to the multi-modal expression recognition model for recognition. The multimodal expression recognition model is a deep learning model such as a Convolutional Neural Network (CNN) or a Recurrent Neural Network (RNN). The training process of this model comprises the following steps:
First, a large amount of multimodal expression data is collected, including facial motion features, eye motion features, voice features, and gesture motion features, and corresponding expression tags. The data may be obtained from the internet or may be collected by laboratory equipment.
These data are then separated into training and test sets. The training set is used to train the model and the test set is used to evaluate the performance of the model.
Next, the structure of the model is defined. For example, if CNN is used, multiple convolution layers, pooling layers, and full connection layers may be defined; if RNN is used, multiple loop layers and full connection layers may be defined.
Then, a loss function and an optimizer of the model are defined. The loss function is used to measure the gap between the predicted and actual results of the model, and the optimizer is used to adjust the parameters of the model to minimize the loss function.
Next, data of the training set is input into the model, prediction results of the model are calculated by forward propagation, and then parameters of the model are updated by backward propagation.
And finally, inputting the data of the test set into the model, calculating a prediction result of the model, and then calculating the accuracy of the model on the test set so as to evaluate the performance of the model.
After the multimodal expression recognition model is obtained, the score in the second expression sequence can be lower than that of the first expression sequenceIs input into the model for recognition. The specific identification process comprises the following steps:
first, facial motion features, eyeball motion features, voice features, and gesture motion features of an expression are input into a model.
Then, the prediction result of the model is calculated by forward propagation. The outcome of this prediction is a probability distribution representing the probability that the model predicts the expression as each possible expression.
And finally, selecting the expression with the highest probability as a prediction result of the model.
And taking the predicted result of the model as a new recognition result of the expression, setting the score of the new recognition result as the highest score, and then updating the second expression sequence to obtain a third expression sequence.
In the above technical solution, the step of dividing the third expression sequence into a single expression and a complex expression to form an expression segment sequence including at least one expression segment specifically includes:
firstly, preprocessing a third expression sequence;
then, carrying out single expression and complex expression division on the pretreated expression sequence;
and finally, combining the divided single expression and the complex expression into an expression segment sequence.
The specific embodiment of this step is described below:
first, the third expression sequence is preprocessed. The purpose of the preprocessing is to reduce noise and unnecessary interference so as to more accurately perform the single expression and complex expression division. The pretreatment mainly comprises the following two steps:
smoothing: and smoothing the third expression sequence to reduce expression fluctuation caused by recognition errors or other non-expression factors. Common smoothing methods include a moving average method and a median filter method. Preferably, the smoothing process may be performed using a moving average method, and the specific formula is:
wherein, the liquid crystal display device comprises a liquid crystal display device,representing the +.f in the smoothed expression sequence>Expression of->Representing the +.f in the original third expression sequence>Expression of->Representing the size of the smoothing window.
Removing the extreme value: and extremum removal is carried out on the expression sequence after the smoothing treatment so as to reduce extreme expressions caused by expression recognition errors or other non-expression factors. A threshold value can be setThen removing the expression value greater or less than +.>Expression of (c).
And then, carrying out single expression and complex expression division on the preprocessed expression sequence. The specific method comprises the following steps:
single expression division: can set an expression change threshold Then the expression change is less than +.>Dividing the continuous expression segment of (a) into a single expression, and changing the expression more than +.>Is divided into a complex expression.
And finally, combining the divided single expression and the complex expression into an expression segment sequence.
The step S70 is implemented as follows:
in this step, the expression segment sequence is output. The expression segment sequence here is a sequence comprising at least one expression segment, each expression segment comprising one or more consecutive frames, each frame having a corresponding expression label.
The sequence of expression segments may be output as a file, each row containing information of one expression segment including a start time, an end time and an expression tag of the expression segment. The sequence of expression segments can also be output as a graphical user interface, and the user can view the expression labels of different time periods by sliding the scroll bar.
This is a specific embodiment of step S70. In this way, the results of the expression recognition can be presented to the user in an intuitive manner, helping the user to better understand the recognition results.
The pseudo code implementing step S70 is as follows:
function output_expression_segment_sequence(E):
s= [ ] # expression segment sequence
Next position to the end position of one expression segment on j=0#
for i in range(len(E) - 1):
if E[i] != E[i + 1]:
S.append((j, i, E[i]))
j = i + 1
S.append((j, len(E) - 1, E[j]))
return S
In this pseudo code, a sequence of empty expression segments S is first created. Then, the next position j of the end position of the last expression segment is initialized to 0. Then, each position i of the emoticon sequence E is traversed. If the emoji tag at position i is different from the emoji tag at position i+1, then a emoji segment is found, from position j to position i, with tag E [ i ]. This expression segment is added to the expression segment sequence S, and j is then updated to i+1. Finally, an expression segment from the position j to the end position of the expression tag sequence E is added to the expression segment sequence S.
A second aspect of the present invention provides a computer readable storage medium having stored therein program instructions which, when executed, are adapted to carry out a multimodal expression recognition method as described above.
A third aspect of the present invention provides a multimodal expression recognition system, comprising the computer readable storage medium described above.
Specifically, the principle of the invention is as follows: the invention provides a multi-modal expression recognition method, which mainly utilizes facial motion characteristics, eyeball motion characteristics, voice characteristics and gesture motion characteristics to effectively recognize the expression of a recognition object.
First, a multi-modal feature of an identification object is obtained from a video. Specifically, face recognition, eyeball recognition and gesture recognition are performed on each frame of the video to obtain a facial feature point set describing facial features, an eyeball feature point set describing eyeball features and a gesture feature point set describing gesture features. Then, a facial feature point set, an eyeball feature point set and a gesture feature point set corresponding to each frame are used for establishing a facial motion feature matrix, an eyeball motion feature matrix and a gesture motion feature matrix according to a time axis to serve as facial motion features, eyeball motion features and gesture motion features. The characteristics can reflect dynamic expression changes of the recognition object, and are beneficial to improving the accuracy of expression recognition.
Secondly, for the acquisition of voice characteristics, the problem that the voice images are not aligned due to distance needs to be considered, and the voice image acquisition method is aligned in a mouth shape and voice matching mode. The specific acquisition mode is as follows: extracting an audio signal from a video; preprocessing the extracted audio signal, including noise reduction, gain and other operations, so as to improve the quality of the audio signal; performing feature extraction on the preprocessed audio signal through a voice feature extractor to obtain basic voice features of the audio signal; identifying the mouth shape in the video to obtain mouth shape characteristics; training a neural network, and obtaining voice characteristics by performing mouth shape voice synchronous processing on mouth shape characteristics and basic voice characteristics.
And then, acquiring a plurality of expressions according to facial action characteristics to form a first expression sequence. And correcting the first expression sequence by utilizing eyeball action characteristics to obtain a second expression sequence. Eye movements have important effects on expression recognition, for example, the degree of eye closure, the speed and direction of eye movement, etc. may affect expression recognition results.
And then scoring the second expression sequence by utilizing the voice characteristic and the gesture action characteristic. The voice features and gesture motion features can provide additional contextual information that helps to more accurately identify expressions. For example, when a recognition object is speaking, its speech characteristics may affect the recognition of expressions; when the recognition object performs certain actions, the gesture action characteristics of the recognition object may influence the recognition of the expression.
And then, recognizing the expression with the score lower than the score threshold value in the second expression sequence by using a pre-trained multi-mode expression recognition model, and updating the second expression sequence by using the recognized expression to obtain a third expression sequence. This step can further improve the accuracy of expression recognition.
And finally, carrying out single expression and complex expression division on the third expression sequence to form an expression segment sequence containing at least one expression segment, and outputting the expression segment sequence. This step may help the user to better understand the change in expression of the recognition object.
In general, the invention realizes the accurate recognition of the expression by fusing the multi-modal characteristics. Meanwhile, by introducing a scoring mechanism and a multi-modal expression recognition model, the accuracy of expression recognition is further improved.
The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present invention. Therefore, the protection scope of the invention is subject to the protection scope of the claims.

Claims (10)

1. The multi-modal expression recognition method is characterized by comprising the following steps of:
s10, acquiring multi-modal characteristics of an identification object from video, wherein the multi-modal characteristics comprise facial action characteristics, eyeball action characteristics, voice characteristics and gesture action characteristics;
s20, acquiring a plurality of expressions according to facial action characteristics to form a first expression sequence;
s30, correcting the first expression sequence by utilizing eyeball action characteristics to obtain a second expression sequence;
s40, scoring the second expression sequence by utilizing the voice characteristics and the gesture action characteristics;
S50, recognizing the expression with the score lower than a score threshold in the second expression sequence by using a pre-trained multi-mode expression recognition model, and updating the second expression sequence by using the recognized expression to obtain a third expression sequence;
s60, carrying out single expression and complex expression division on the third expression sequence to form an expression segment sequence containing at least one expression segment;
s70, outputting the expression segment sequence.
2. The method for identifying a multi-modal expression according to claim 1, wherein the facial motion feature, the eyeball motion feature and the gesture motion feature are obtained by:
performing face recognition, eyeball recognition and gesture recognition on each frame of the video to obtain a facial feature point set describing facial features, an eyeball feature point set describing eyeball features and a gesture feature point set describing gesture features;
and establishing a facial motion feature matrix, an eyeball motion feature matrix and a gesture motion feature matrix according to the time axis by using the facial feature point set, the eyeball feature point set and the gesture feature point set corresponding to each frame as facial motion features, eyeball motion features and gesture motion features.
3. The method for identifying multi-modal expressions according to claim 1, wherein the voice features are obtained by:
extracting an audio signal from the video;
preprocessing the extracted audio signal, including noise reduction, gain and other operations, so as to improve the quality of the audio signal;
performing feature extraction on the preprocessed audio signal through a voice feature extractor to obtain basic voice features of the audio signal;
identifying the mouth shape in the video to obtain mouth shape characteristics;
training a neural network, and obtaining voice characteristics by performing mouth shape voice synchronous processing on mouth shape characteristics and basic voice characteristics.
4. The method for recognizing multiple expressions according to claim 2, wherein the step of obtaining multiple expressions according to facial motion features to form a first expression sequence comprises the following steps: and recognizing the expression of each frame according to the facial feature point set corresponding to each frame, wherein the expression of each frame comprises expression expressive degrees, and forming a first expression sequence according to a time axis by the obtained expression of each frame.
5. The method for recognizing a multi-modal expression according to claim 1, wherein the step of correcting the first expression sequence by using the eye motion feature to obtain a second expression sequence comprises:
Step 1: quantifying eyeball motion characteristics;
step 2: establishing a corresponding relation between eyeball action characteristics and expressions;
step 3: and correcting the first expression sequence to obtain a second expression sequence.
6. The method for identifying a multi-modal expression according to claim 1, wherein the step of scoring the second expression sequence by using voice features and gesture motion features specifically comprises:
extracting corresponding voice characteristics and gesture action characteristics of each expression in the second expression sequence, and directly setting the score of the corresponding expression as the highest score if the voice characteristics or the gesture action characteristics do not exist; if the voice feature and the gesture motion feature exist at the same time, the method comprises the following steps:
scoring the voice features by using a voice emotion scoring model to obtain voice emotion scores;
scoring the gesture action features by using a gesture emotion scoring model to obtain gesture emotion scores;
and calculating expression scores by using a weighted average method.
7. The method for multi-modal expression recognition according to claim 1, wherein the step of recognizing the expression of the second expression sequence having the score lower than the score threshold by using a pre-trained multi-modal expression recognition model and updating the second expression sequence by using the recognized expression to obtain a third expression sequence specifically comprises:
Step 1, defining a scoring threshold value;
step 2, traversing each expression in the second expression sequence to obtain expressions with all scores smaller than a score threshold;
and 3, recognizing the expression obtained in the step 2 by using a pre-trained multi-modal expression recognition model to obtain a new expression, setting a scoring threshold value of the new expression as 1, and updating the second expression sequence to obtain a third expression sequence.
8. The method according to claim 1, wherein the step of dividing the third expression sequence into a single expression and a complex expression to form an expression segment sequence including at least one expression segment comprises:
firstly, preprocessing a third expression sequence;
then, carrying out single expression and complex expression division on the pretreated expression sequence;
and finally, combining the divided single expression and the complex expression into an expression segment sequence.
9. A computer readable storage medium, wherein program instructions are stored in the computer readable storage medium, which program instructions, when executed, are adapted to carry out a multimodal expression recognition method as claimed in any one of claims 1-8.
10. A multimodal expression recognition system comprising the computer readable storage medium of claim 9.
CN202310973040.3A 2023-08-04 2023-08-04 Multi-modal expression recognition method, medium and system Active CN116682168B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310973040.3A CN116682168B (en) 2023-08-04 2023-08-04 Multi-modal expression recognition method, medium and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310973040.3A CN116682168B (en) 2023-08-04 2023-08-04 Multi-modal expression recognition method, medium and system

Publications (2)

Publication Number Publication Date
CN116682168A true CN116682168A (en) 2023-09-01
CN116682168B CN116682168B (en) 2023-10-17

Family

ID=87779594

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310973040.3A Active CN116682168B (en) 2023-08-04 2023-08-04 Multi-modal expression recognition method, medium and system

Country Status (1)

Country Link
CN (1) CN116682168B (en)

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104318221A (en) * 2014-11-05 2015-01-28 中南大学 Facial expression recognition method based on ELM
CN105868694A (en) * 2016-03-24 2016-08-17 中国地质大学(武汉) Dual-mode emotion identification method and system based on facial expression and eyeball movement
US20170160813A1 (en) * 2015-12-07 2017-06-08 Sri International Vpa with integrated object recognition and facial expression recognition
CN107808146A (en) * 2017-11-17 2018-03-16 北京师范大学 A kind of multi-modal emotion recognition sorting technique
CN108596039A (en) * 2018-03-29 2018-09-28 南京邮电大学 A kind of bimodal emotion recognition method and system based on 3D convolutional neural networks
CN109801096A (en) * 2018-12-14 2019-05-24 中国科学院深圳先进技术研究院 A kind of multi-modal customer satisfaction overall evaluation system, method
CN109815938A (en) * 2019-02-27 2019-05-28 南京邮电大学 Multi-modal affective characteristics recognition methods based on multiclass kernel canonical correlation analysis
CN110969106A (en) * 2019-11-25 2020-04-07 东南大学 Multi-mode lie detection method based on expression, voice and eye movement characteristics
CN111401268A (en) * 2020-03-19 2020-07-10 内蒙古工业大学 Multi-mode emotion recognition method and device for open environment
KR20200085696A (en) * 2018-01-02 2020-07-15 주식회사 제네시스랩 Method of processing video for determining emotion of a person
CN113033450A (en) * 2021-04-02 2021-06-25 山东大学 Multi-mode continuous emotion recognition method, service inference method and system
CN113961063A (en) * 2021-09-01 2022-01-21 泉州市泽锐航科技有限公司 Multi-information fusion man-machine interaction method and system based on deep learning

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104318221A (en) * 2014-11-05 2015-01-28 中南大学 Facial expression recognition method based on ELM
US20170160813A1 (en) * 2015-12-07 2017-06-08 Sri International Vpa with integrated object recognition and facial expression recognition
CN105868694A (en) * 2016-03-24 2016-08-17 中国地质大学(武汉) Dual-mode emotion identification method and system based on facial expression and eyeball movement
CN107808146A (en) * 2017-11-17 2018-03-16 北京师范大学 A kind of multi-modal emotion recognition sorting technique
KR20200085696A (en) * 2018-01-02 2020-07-15 주식회사 제네시스랩 Method of processing video for determining emotion of a person
CN108596039A (en) * 2018-03-29 2018-09-28 南京邮电大学 A kind of bimodal emotion recognition method and system based on 3D convolutional neural networks
CN109801096A (en) * 2018-12-14 2019-05-24 中国科学院深圳先进技术研究院 A kind of multi-modal customer satisfaction overall evaluation system, method
CN109815938A (en) * 2019-02-27 2019-05-28 南京邮电大学 Multi-modal affective characteristics recognition methods based on multiclass kernel canonical correlation analysis
CN110969106A (en) * 2019-11-25 2020-04-07 东南大学 Multi-mode lie detection method based on expression, voice and eye movement characteristics
CN111401268A (en) * 2020-03-19 2020-07-10 内蒙古工业大学 Multi-mode emotion recognition method and device for open environment
CN113033450A (en) * 2021-04-02 2021-06-25 山东大学 Multi-mode continuous emotion recognition method, service inference method and system
CN113961063A (en) * 2021-09-01 2022-01-21 泉州市泽锐航科技有限公司 Multi-information fusion man-machine interaction method and system based on deep learning

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
M. PANTIC 等: "Dynamics of facial expression: recognition of facial actions and their temporal segments from face profile image sequences", IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS, PART B (CYBERNETICS) ( VOLUME: 36, ISSUE: 2, APRIL 2006) *
洪惠群 等: "表情识别技术综述", 计算机科学与探索, pages 1764 - 1778 *
王志堂;蔡淋波;: "隐马尔可夫模型(HMM)及其应用", 湖南科技学院学报, no. 04 *

Also Published As

Publication number Publication date
CN116682168B (en) 2023-10-17

Similar Documents

Publication Publication Date Title
CN108805087B (en) Time sequence semantic fusion association judgment subsystem based on multi-modal emotion recognition system
CN108805089B (en) Multi-modal-based emotion recognition method
CN108899050B (en) Voice signal analysis subsystem based on multi-modal emotion recognition system
CN108877801B (en) Multi-turn dialogue semantic understanding subsystem based on multi-modal emotion recognition system
CN110427867B (en) Facial expression recognition method and system based on residual attention mechanism
CN108805088B (en) Physiological signal analysis subsystem based on multi-modal emotion recognition system
CN109409296B (en) Video emotion recognition method integrating facial expression recognition and voice emotion recognition
WO2020119630A1 (en) Multi-mode comprehensive evaluation system and method for customer satisfaction
Chen et al. K-means clustering-based kernel canonical correlation analysis for multimodal emotion recognition in human–robot interaction
CN112800903B (en) Dynamic expression recognition method and system based on space-time diagram convolutional neural network
CN101877056A (en) Facial expression recognition method and system, and training method and system of expression classifier
CN108256307B (en) Hybrid enhanced intelligent cognitive method of intelligent business travel motor home
Sharma et al. D-FES: Deep facial expression recognition system
CN110969106A (en) Multi-mode lie detection method based on expression, voice and eye movement characteristics
CN111666845B (en) Small sample deep learning multi-mode sign language recognition method based on key frame sampling
Ocquaye et al. Dual exclusive attentive transfer for unsupervised deep convolutional domain adaptation in speech emotion recognition
CN115169507A (en) Brain-like multi-mode emotion recognition network, recognition method and emotion robot
CN108830170A (en) A kind of end-to-end method for tracking target indicated based on layered characteristic
CN109034090A (en) A kind of emotion recognition system and method based on limb action
CN113920568A (en) Face and human body posture emotion recognition method based on video image
CN112418166A (en) Emotion distribution learning method based on multi-mode information
CN116682168B (en) Multi-modal expression recognition method, medium and system
CN116758451A (en) Audio-visual emotion recognition method and system based on multi-scale and global cross attention
Ma et al. Sign language recognition based on concept learning
CN113343773B (en) Facial expression recognition system based on shallow convolutional neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant