CN113947127A - Multi-mode emotion recognition method and system for accompanying robot - Google Patents
Multi-mode emotion recognition method and system for accompanying robot Download PDFInfo
- Publication number
- CN113947127A CN113947127A CN202111079583.8A CN202111079583A CN113947127A CN 113947127 A CN113947127 A CN 113947127A CN 202111079583 A CN202111079583 A CN 202111079583A CN 113947127 A CN113947127 A CN 113947127A
- Authority
- CN
- China
- Prior art keywords
- emotion
- modal
- feature vectors
- extracting
- matrix
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 59
- 230000008909 emotion recognition Effects 0.000 title claims abstract description 49
- 230000008451 emotion Effects 0.000 claims abstract description 150
- 239000013598 vector Substances 0.000 claims abstract description 91
- 230000008921 facial expression Effects 0.000 claims abstract description 57
- 239000011159 matrix material Substances 0.000 claims abstract description 55
- 230000004927 fusion Effects 0.000 claims abstract description 47
- 230000002996 emotional effect Effects 0.000 claims abstract description 40
- 238000012549 training Methods 0.000 claims description 27
- 238000013145 classification model Methods 0.000 claims description 24
- 238000000605 extraction Methods 0.000 claims description 24
- 238000004458 analytical method Methods 0.000 claims description 23
- 230000008569 process Effects 0.000 claims description 19
- 238000007781 pre-processing Methods 0.000 claims description 17
- 238000012545 processing Methods 0.000 claims description 16
- 230000007246 mechanism Effects 0.000 claims description 12
- 238000004422 calculation algorithm Methods 0.000 claims description 11
- 238000004364 calculation method Methods 0.000 claims description 11
- 230000009191 jumping Effects 0.000 claims description 10
- 230000014509 gene expression Effects 0.000 claims description 8
- 238000005070 sampling Methods 0.000 claims description 7
- 230000003213 activating effect Effects 0.000 claims description 6
- 238000013527 convolutional neural network Methods 0.000 claims description 5
- 238000010606 normalization Methods 0.000 claims description 5
- 238000010276 construction Methods 0.000 claims description 4
- 238000000354 decomposition reaction Methods 0.000 claims description 4
- 238000009499 grossing Methods 0.000 claims description 4
- 238000010183 spectrum analysis Methods 0.000 claims description 4
- 210000001260 vocal cord Anatomy 0.000 claims description 4
- 241000282414 Homo sapiens Species 0.000 abstract description 12
- 230000008859 change Effects 0.000 abstract description 4
- 238000012706 support-vector machine Methods 0.000 abstract description 2
- 230000006870 function Effects 0.000 description 16
- 238000010586 diagram Methods 0.000 description 13
- 230000003993 interaction Effects 0.000 description 9
- 238000011160 research Methods 0.000 description 6
- 230000008901 benefit Effects 0.000 description 5
- 238000011176 pooling Methods 0.000 description 5
- 239000000284 extract Substances 0.000 description 4
- 238000012512 characterization method Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 238000002474 experimental method Methods 0.000 description 3
- 230000001815 facial effect Effects 0.000 description 3
- 230000004913 activation Effects 0.000 description 2
- 230000003542 behavioural effect Effects 0.000 description 2
- 210000004556 brain Anatomy 0.000 description 2
- 238000001514 detection method Methods 0.000 description 2
- 238000009826 distribution Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000013507 mapping Methods 0.000 description 2
- 206010063659 Aversion Diseases 0.000 description 1
- 241000282412 Homo Species 0.000 description 1
- 230000002159 abnormal effect Effects 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000007635 classification algorithm Methods 0.000 description 1
- 238000013480 data collection Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000007499 fusion processing Methods 0.000 description 1
- 230000004886 head movement Effects 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 230000006996 mental state Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 210000005036 nerve Anatomy 0.000 description 1
- 210000002569 neuron Anatomy 0.000 description 1
- 210000000056 organ Anatomy 0.000 description 1
- 230000036544 posture Effects 0.000 description 1
- 238000013139 quantization Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 210000000697 sensory organ Anatomy 0.000 description 1
- 239000002689 soil Substances 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 238000009827 uniform distribution Methods 0.000 description 1
- 239000002699 waste material Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- A—HUMAN NECESSITIES
- A61—MEDICAL OR VETERINARY SCIENCE; HYGIENE
- A61B—DIAGNOSIS; SURGERY; IDENTIFICATION
- A61B5/00—Measuring for diagnostic purposes; Identification of persons
- A61B5/16—Devices for psychotechnics; Testing reaction times ; Devices for evaluating the psychological state
-
- A—HUMAN NECESSITIES
- A61—MEDICAL OR VETERINARY SCIENCE; HYGIENE
- A61B—DIAGNOSIS; SURGERY; IDENTIFICATION
- A61B5/00—Measuring for diagnostic purposes; Identification of persons
- A61B5/24—Detecting, measuring or recording bioelectric or biomagnetic signals of the body or parts thereof
- A61B5/316—Modalities, i.e. specific diagnostic methods
- A61B5/369—Electroencephalography [EEG]
-
- A—HUMAN NECESSITIES
- A61—MEDICAL OR VETERINARY SCIENCE; HYGIENE
- A61B—DIAGNOSIS; SURGERY; IDENTIFICATION
- A61B5/00—Measuring for diagnostic purposes; Identification of persons
- A61B5/24—Detecting, measuring or recording bioelectric or biomagnetic signals of the body or parts thereof
- A61B5/316—Modalities, i.e. specific diagnostic methods
- A61B5/369—Electroencephalography [EEG]
- A61B5/372—Analysis of electroencephalograms
-
- A—HUMAN NECESSITIES
- A61—MEDICAL OR VETERINARY SCIENCE; HYGIENE
- A61B—DIAGNOSIS; SURGERY; IDENTIFICATION
- A61B5/00—Measuring for diagnostic purposes; Identification of persons
- A61B5/72—Signal processing specially adapted for physiological signals or for diagnostic purposes
- A61B5/7203—Signal processing specially adapted for physiological signals or for diagnostic purposes for noise prevention, reduction or removal
-
- A—HUMAN NECESSITIES
- A61—MEDICAL OR VETERINARY SCIENCE; HYGIENE
- A61B—DIAGNOSIS; SURGERY; IDENTIFICATION
- A61B5/00—Measuring for diagnostic purposes; Identification of persons
- A61B5/72—Signal processing specially adapted for physiological signals or for diagnostic purposes
- A61B5/7235—Details of waveform analysis
- A61B5/7264—Classification of physiological signals or data, e.g. using neural networks, statistical classifiers, expert systems or fuzzy systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2411—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/18—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/45—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/63—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Theoretical Computer Science (AREA)
- Artificial Intelligence (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Molecular Biology (AREA)
- Psychiatry (AREA)
- Signal Processing (AREA)
- Data Mining & Analysis (AREA)
- Computational Linguistics (AREA)
- Evolutionary Computation (AREA)
- Pathology (AREA)
- Heart & Thoracic Surgery (AREA)
- Veterinary Medicine (AREA)
- Public Health (AREA)
- Animal Behavior & Ethology (AREA)
- Medical Informatics (AREA)
- Surgery (AREA)
- General Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Mathematical Physics (AREA)
- Acoustics & Sound (AREA)
- Human Computer Interaction (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Psychology (AREA)
- Software Systems (AREA)
- Child & Adolescent Psychology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Computing Systems (AREA)
- Physiology (AREA)
- Hospice & Palliative Care (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Fuzzy Systems (AREA)
Abstract
The invention relates to a multi-modal emotion recognition method and a system for a companion robot, wherein the method comprises the steps of respectively collecting facial expression pictures, voice signals and electroencephalogram signals; extracting emotional feature vectors of facial expressions, emotional feature vectors of voice and feature vectors of electroencephalogram signals; acquiring a weight matrix, and multiplying each eigenvector by the weight matrix to obtain fusion characteristics; the classification of four types of common emotions, namely happiness, sadness, calmness and disgust, is realized through a support vector machine; and (3) forecasting the emotion score by adopting a multivariate nonlinear regression mode through evolving the emotion into four dimensions of pleasure-tension-excitement-certainty degree. Compared with the prior art, the method has the emotion recognition capability closer to human through information fusion; the ability of autonomous evolution and continuous adjustment of emotion judgment is realized by utilizing the form of dynamically updating the weight parameters; discrete and continuous emotion recognition can realize more scientific and deep description of emotion change.
Description
Technical Field
The invention relates to the technical field of emotion recognition, in particular to a multi-mode emotion recognition method and system for a companion robot.
Background
The emotion interaction is paid great attention in the research of human-computer natural interaction, emotion recognition is the key of human-computer emotion interaction, and the research aims to enable a machine to sense the emotional state of human beings and improve the humanization level of the machine. The multi-modal emotion recognition technology has wide application prospect and research value in the field of companion robots. By utilizing various sensors carried by the robot, multi-mode signals such as human facial expressions, behaviors, voice, physiological signals and the like containing potential emotion characteristics are obtained, the characteristics are extracted and fused based on a deep learning method, and the emotion of the human is analyzed and predicted, so that the companion robot has stronger emotion recognition and emotion understanding capabilities.
At present, the emotion recognition device is suitable for a companion robot to carry, and the physiological characteristics, postures, gestures, intonations and other waveform changes caused by human emotion changes are generally analyzed and identified in detail through a chip, a video system, an audio system and other systems, so that the emotion of a human is understood deeply, and clear and timely responses are presented. Emotion recognition based on facial expressions is generally performed using two-dimensional images for analysis and study, i.e., geometric methods based on the facial organs and the facial raised position features, pixel methods based on facial texture features, and hybrid methods combining the two. The emotion recognition method based on voice usually extracts and induces prosodic information and voice quality features in voice signals, wherein the prosodic information and the voice quality features comprise Mel Frequency Cepstrum Coefficients (MFCC), Teager energy operators and the like, and emotion classification and recognition are realized through a support vector machine or a long-short term memory network. In the aspect of physiological signals, emotion understanding is carried out on the basis of a traditional machine learning mode and a pulse neural network by utilizing signal frequency bands most relevant to emotion, time stability characteristics of brain areas and electroencephalogram and the like. Part of the current research is integrating features from various behavioral and physiological manifestations in a framework of emotion recognition for emotion recognition. The behavioral corresponding mental state is inferred, for example, from a combination of head movements and facial expressions, which in turn infer emotional expressions of humans. Meanwhile, discrete emotion classification and recognition are realized by combining expressions and voice signals and utilizing a mode of expressing subspace among shared modes.
Most of the conventional accompanying robots lack the function of emotion recognition. And the robot partially carrying a specific sensor can only realize a simple emotion recognition function based on a single mode. The emotion recognition technology based on facial expressions or voice signals only does not consider the complementarity and the promotion effect among emotion expressions acquired in different modes, and the emotion recognition efficiency is low under the condition that corresponding emotion information is interfered or the information acquisition is insufficient, so that the application requirement of emotion interaction cannot be met; meanwhile, most of the existing technical means aim at the external emotion expression characteristics of human beings to be identified, and the important significance of monitoring and detecting physiological signals to emotion identification is not considered. The electroencephalogram and nerve signals can accurately, objectively and real-timely reflect the abnormal emotion and psychological state change, and the accompanying robot is beneficial to performing emotion analysis on different accompanying people and achieving the purpose of accurate emotion comfort.
In recent years. The emotion recognition function module carried by the accompanying robot can only perform simple preprocessing on the acquired modal information, so that the problems of data loss and errors often occur; on the basis, data set fusion is generally adopted for a multi-modal data fusion mode, namely complex and tedious data fusion is carried out under the condition that the data integrity cannot be guaranteed, and the data resource waste is greatly caused. In addition, most of the traditional methods adopt a discrete emotion recognition strategy, and the continuity and the heterogeneity of human emotion changes are not fully considered, so that the performance of emotion recognition is usually poor.
In summary, a method based on multi-modal feature acquisition and expression is developed, multi-modal emotion characterization data of facial expressions, voices and physiological signals are used, the discriminative power of heterogeneous features is fully expressed, the existing single-modal data research difficulty is overcome, and the construction of a multi-modal emotion recognition system suitable for a companion robot becomes a problem to be solved by technical personnel in the research field.
Disclosure of Invention
The present invention is directed to overcoming the above-mentioned drawbacks of the prior art, and providing a method and a system for multi-modal emotion recognition for a companion robot, which fully express the discriminative power of heterogeneous features by using multi-modal emotion characterization data of facial expressions, speech and physiological signals.
The purpose of the invention can be realized by the following technical scheme:
a multi-mode emotion recognition method for a companion robot comprises the following steps:
respectively collecting facial expression pictures, voice signals and electroencephalogram signals;
extracting emotional feature vectors of facial expressions according to the facial expression pictures, extracting emotional feature vectors of voice according to the voice signals, and extracting feature vectors of electroencephalogram signals according to the electroencephalogram signals;
acquiring a weight matrix, and multiplying the emotional feature vector of the facial expression, the emotional feature vector of the voice and the feature vector of the electroencephalogram signal by the weight matrix to obtain a fusion feature;
loading the fusion features into a pre-constructed and trained classification model for classification and identification to obtain a plurality of discrete emotion label identification results, wherein the classification model is also used for training the weight matrix in the training process;
and performing emotion prediction according to the fusion features, wherein the emotion prediction is used for performing data fitting training on the fusion features to obtain continuous emotion intensity values, the emotion intensity values are divided into a plurality of emotion dimensions, and the emotion dimensions comprise pleasure, tension, excitement and certainty.
Further, the extracting the emotional feature vector of the facial expression specifically includes:
the method comprises the steps of firstly extracting Haar features in a facial expression picture by using an Adaboost algorithm, constructing a Haar feature graph, then preprocessing the Haar feature graph through histogram equalization, and then extracting emotional feature vectors of facial expressions by using a uniform pattern LBP algorithm.
Further, the extraction process of the homogeneous pattern LBP algorithm includes:
constructing a texture region with the size of 3 x 3, wherein a threshold value is a central pixel value of the texture region, comparing 8 surrounding pixel values with the threshold value, and if the value is greater than the value of a threshold pixel, setting the region where the pixel is located as 1; if the value is less than the value of the threshold pixel, the area where the pixel is located is set to 0; in the texture region of 3 x 3, forming 8-bit binary numbers by the values generated by 8 adjacent pixel points according to the clockwise direction, counting the jumping times from 0 to 1 or from 1 to 0 in the 8-bit binary numbers, if the jumping number is less than two times, the decimal number corresponding to the binary number is the LBP value of the center of the 3 x 3 neighborhood; if the jumping times are larger than 2, if P is 8, setting the LBP value of the center of the area as P +1 is 9; and traversing all the pixel points to obtain the LBP value of the whole image, and sequentially connecting all the LBP values into a feature vector, namely the emotional feature vector of the facial expression.
Further, the extracting the emotion feature vector of the speech specifically includes:
firstly, windowing processing is carried out on a voice signal, a Hamming window is used for carrying out smoothing processing, and a time domain signal is converted into a frequency domain for subsequent spectrum analysis; then, a high-pass filter is designed to eliminate noise interference of vocal cord pronunciation, and MFCC feature extraction is carried out; and inputting the spectrogram subjected to Fourier transform into a pre-constructed and trained convolutional neural network layer, and extracting spectrogram features to obtain the emotional feature vector of the voice.
Further, the extracting the feature vector of the electroencephalogram signal specifically includes:
firstly, preprocessing and denoising electroencephalogram signals, then respectively extracting parting dimensional features and multi-scale entropy features, and constructing feature vectors of the electroencephalogram signals.
Further, the preprocessing denoising comprises:
collecting the electroencephalogram signals of a main body through fixed sampling frequency, then selecting db5 wavelet to carry out multi-layer wavelet decomposition, then using a soft threshold method to set the wavelet packet coefficient generated by noise to zero, and finally completing the reconstruction of the electroencephalogram signals.
Further, the extraction process of the typing dimensional features comprises the following steps:
uniformly sampling an original sequence to obtain K sequences, calculating the variable quantity of each element in the K sequences, constructing a new sequence, fitting the new sequence to obtain a slope, and taking the opposite number of the slope as the FD initial characteristic; performing window processing by utilizing the preprocessed and denoised electroencephalogram signals, dividing data into a plurality of sections by using a window, and respectively extracting parting dimensional characteristics according to each section of data;
the extraction process of the multi-scale entropy features comprises the following steps:
and calculating the multi-scale moisture of the electroencephalogram signal, solving the average multi-scale moisture value of the happy emotion and the sad emotion of the tested subject, and then selecting the multi-scale entropy under the previous one or more scales as the feature vector of the electroencephalogram signal.
Further, before the classification recognition, the method further comprises: respectively carrying out data normalization on the emotional feature vectors of the facial expressions, the emotional feature vectors of the voices and the feature vectors of the electroencephalogram signals to obtain fusion features, and carrying out classification and identification; the classification model adopts an SVM classification model, and the kernel function of the SVM classification model is an RBF kernel function; the emotion tags include happy, sad, calm, and disliked.
Further, calculating modal attention through a self-attention mechanism, so as to construct the weight matrix and obtain a fusion weight, wherein a calculation expression of the modal attention is as follows:
in the formula, A is modal attention, (. cndot.) is matrix multiplication, theta is a query matrix, phi is a key matrix, T is a transposed symbol, and d is an embedding dimension;
the construction process of the query matrix theta is as follows: connecting the feature vectors of each mode through a first full connection layer, wherein the connection formula of the first full connection layer is y1=w1x+b1Finally, a characteristic quantity matrix is obtained through activating function output, and a query matrix theta is formed; the query matrix Θ is used to represent the influence of the current modality on other modalities;
the key matrix phi is constructed by the following steps: connecting the feature vectors of each mode through a second fully-connected layer, wherein the connection formula of the second fully-connected layer is y2=w2x+b2Finally, a characteristic quantity matrix is obtained through activating function output to form a key matrix phi; the key matrix phi is used for representing the influence of other modalities on the current modality;
adding the elements of each row in the modal attention A to obtain the weight of the modal i, wherein the corresponding calculation formula is as follows:
Ψi=∑kaki
in the formula, ΨiIs the weight of modality i, akiIs the element in the kth row and the ith column in modal attention A;
the sum of all the modalities being high in weight is 1, i.e.:
∑iΨi=1。
the modal attention A is trained along with the training process of the classification model to adjustParameter w in first and second fully-connected layers1、b1、w2And b2。
The invention also provides a multi-modal emotion recognition system for a companion robot, which comprises:
the multi-mode acquisition module is used for respectively acquiring facial expression pictures, voice signals and electroencephalogram signals;
the emotion analysis module is used for extracting emotion feature vectors of the facial expressions according to the facial expression pictures;
the emotion analysis module is used for extracting emotion feature vectors of the voice according to the voice signals;
the emotion analysis module is used for extracting feature vectors of the electroencephalogram signals according to the electroencephalogram signals;
the feature fusion module based on the self-attention mechanism is used for acquiring a weight matrix, and multiplying the emotional feature vector of the facial expression, the emotional feature vector of the voice and the feature vector of the electroencephalogram signal by the weight matrix to obtain fusion features;
the recognition module based on discrete emotion classification is used for loading the fusion features into a pre-constructed and trained classification model for classification recognition to obtain a plurality of discrete emotion label recognition results, and is also used for training the weight matrix in the training process of the classification model;
and the emotion prediction module is used for carrying out emotion prediction according to the fusion characteristics, and the emotion prediction is used for carrying out data fitting training on the fusion characteristics to obtain continuous emotion intensity values which are divided into a plurality of emotion dimensions, wherein the emotion dimensions comprise pleasure, tension, excitement and certainty.
Compared with the prior art, the invention has the following advantages:
different from the traditional single-modal emotion recognition system, the emotion recognition method and the emotion recognition system fully combine facial expressions, voice signals and electroencephalogram signals to carry out emotion analysis and discrimination, and can enhance the enlarging capability of emotion characteristics and perfect the mapping capability of emotion characterization space through multi-modal information fusion, so that the robot can show the emotion recognition capability closer to that of human beings.
Meanwhile, the multi-mode fusion mode based on the self-attention mechanism has self-adaptability and flexibility, and the robot has the capabilities of autonomous evolution and continuously adjusting emotion judgment by combining emotion expression advantages of different heterogeneous modes and utilizing a form of dynamically updating weight parameters, so that a new paradigm is provided for man-machine emotion interaction.
In addition, the multi-mode emotion state is effectively described through the combined processing of the emotion recognition modes based on discrete emotion classification and continuous emotion dimension prediction, emotion changes can be described more scientifically and deeply, and the method has obvious advantages for the robot to widely understand human emotions. The system reduces the loss of a large amount of information when facing complex nonlinear multi-modal information processing, and performs well when processing data with large modal span.
Drawings
Fig. 1 is a schematic block diagram of a multi-modal emotion recognition system for a companion robot provided in an embodiment of the present invention;
FIG. 2 is a schematic block diagram of an emotion analysis module based on facial expressions provided in an embodiment of the present invention;
FIG. 3 is a schematic block diagram of an emotion analysis module based on speech signals provided in an embodiment of the present invention;
FIG. 4 is a schematic block diagram of an emotion analysis module based on electroencephalogram signals provided in an embodiment of the present invention;
FIG. 5 is a schematic block diagram of a feature fusion module based on a self-attention mechanism provided in an embodiment of the present invention;
FIG. 6 is a schematic block diagram of an identification module based on discrete emotion classification according to an embodiment of the present invention;
FIG. 7 is a schematic block diagram of a prediction module based on continuous emotion according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.
Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Example 1
As shown in fig. 1, the present embodiment provides a multimodal emotion recognition system for a companion robot, including:
the multi-mode acquisition module is used for respectively acquiring facial expression pictures, voice signals and electroencephalogram signals;
the emotion analysis module is used for extracting emotion feature vectors of the facial expressions according to the facial expression pictures;
the emotion analysis module is used for extracting emotion feature vectors of the voice according to the voice signals;
the emotion analysis module is used for extracting feature vectors of the electroencephalogram signals according to the electroencephalogram signals;
the feature fusion module based on the self-attention mechanism is used for acquiring a weight matrix, and multiplying the emotional feature vector of the facial expression, the emotional feature vector of the voice and the feature vector of the electroencephalogram signal by the weight matrix to obtain fusion features;
the recognition module based on discrete emotion classification is used for loading the fusion features into a pre-constructed and trained classification model for classification recognition to obtain a plurality of discrete emotion label recognition results, and is also used for training the weight matrix in the training process of the classification model;
the prediction module based on continuous emotion is used for carrying out emotion prediction according to the fusion features, the emotion prediction is used for carrying out data fitting training on the fusion features to obtain continuous emotion intensity values, the emotion intensity values are divided into a plurality of emotion dimensions, and the emotion dimensions comprise pleasure, tension, excitement and certainty;
and the display module is used for displaying the results of the discrete emotion classification and the continuous emotion prediction in real time.
In summary, the multi-modal information is derived from facial expressions, voice signals and electroencephalogram signals acquired by the accompanying robot through the acquisition module. The operations of preprocessing, data feature extraction, modal output and the like of the multi-modal information are realized through recognition units with different modalities in the recognition module. And (3) calculating attention weight coefficients for different modes by introducing a self-attention mechanism in the fusion module so as to realize the mode fusion of the characteristic level, and displaying the multi-mode emotion recognition result by using the display module after passing through the emotion classification and prediction module.
Each module is described in detail below.
1. Multi-modal acquisition module
The multi-mode acquisition module comprises an emotion acquisition device for facial expressions, an emotion acquisition device for voice signals, an emotion acquisition device for electroencephalogram signals, an emotion acquisition device for physiological signals and the like, and multi-mode data collection operation of an accompanying main body aimed by the accompanying robot is achieved through the multi-sensor.
2. Emotion analysis module based on facial expressions
Fig. 2 is a schematic block diagram of an emotion analysis module based on facial expressions according to the present invention, which includes four units of face detection, image and processing, feature extraction, and modal output, and based on different units, the specific steps are as follows:
(1) firstly, target detection is realized by using Adaboost combined with Haar characteristics. The Adaboost algorithm extracts the Haar-like features of the face, including rectangular features of the input image. The Haar feature is a feature reflecting the gray level change of an image, a black rectangle and a white rectangle are combined into a feature template, and the Haar feature value of the template is calculated by subtracting the sum of the pixels of the white rectangle from the sum of the pixels of the black rectangle. The common features comprise edge features, line features, center surrounding features and diagonal features, the outline of the five sense organs of the human face has color difference with the face, and the gray level change of the human face can be described by using Haar features. In order to achieve fast calculation, an integral graph approach is used. The integral map can quickly calculate the sum of pixels of any rectangular area in the image, so that the Haar characteristic of the image can be quickly calculated.
(2) The image preprocessing can recover useful information in the image and reduce irrelevant information in the image, wherein the histogram equalization is used for changing the histogram distribution of the image into approximately uniform distribution, so that the contrast of the image is enhanced.
(3) In a feature extraction unit, considering the requirement of low complexity of a system suitable for a companion robot, extracting facial expression features by adopting a uniform pattern LBP algorithm, and comparing 8 surrounding pixel values with a threshold value by constructing a 3 x 3 texture region, wherein the threshold value is the central pixel value of the texture region, and if the value is greater than the value of the threshold pixel, the neighborhood position is set to be 1; if the value is less than the threshold pixel value, the neighborhood is set to 0. And in the 3 x 3 area, forming 8-bit binary numbers by the values generated by 8 adjacent pixel points according to the clockwise direction. Counting the number of 0 to 1 or 1 to 0 jump in 8-bit binary number, if the jump number is less than two times, the decimal number corresponding to the binary number is the LBP value of the 3 x 3 neighborhood center; if the jumping times are larger than 2, if P is 8, setting the LBP value of the center of the area as P +1 is 9; and traversing all the pixel points to obtain the LBP value of the whole image, and sequentially connecting all the LBP values into a feature vector, namely the uniform mode LBP feature of the whole image.
(4) After the emotional characteristic vector of the facial expression is obtained, the modal output unit is used for pre-storing and outputting the modal.
3. Emotion analysis module based on voice signal
Fig. 3 is a schematic block diagram of an emotion analysis module based on speech signals, which mainly includes a data preprocessing unit, an MFCC feature extraction unit, a spectrogram feature extraction unit, and a modal output unit. The basic steps of the unit are as follows:
(1) in order to make the obtained original signal smoother, windowing is carried out on the signal, a Hamming window is used for carrying out smoothing, and the time domain signal is converted into a frequency domain for subsequent spectrum analysis.
(2) For the MFCC feature extraction stage, the speech signal is converted into a Mel frequency and a Hertz frequency, and then the cepstrum analysis is carried out, the method has the idea that in the process of speaking, due to other interferences generated by lips and vocal cords, in order to solve the interferences, a step-length speech signal is subjected to a high-frequency part suppressed by a pronunciation system, and a formant of the high frequency can be highlighted, so that a high-pass filter is added. By increasing the coefficient on the frequency domain, and forming positive correlation with the frequency, the amplitude of the high frequency is further improved.
(3) In the spectrogram feature extraction unit, the spectrogram after Fourier transform is input into a convolutional neural network layer to extract spectrogram features, and the process comprises an input layer, a pooling layer, a convolutional layer and a full-link layer. The CNN structure comprises an input layer, 2 convolution layers, 2 pooling layers and a full-connection layer. The size of an input image is 128 pixels by 128 pixels, the convolution layer of the first layer is composed of 64 convolution kernels of 5 pixels by 5 pixels, a nonlinear unit ReLU activation function is introduced after convolution, a 2 pixel by 2 convolution kernel is connected after convolution to form a pooling layer, and the purpose of connecting the pooling layer is to reduce the calculation complexity and extract main features. The second layer of convolution consists of 5 x 5 of 128 convolution kernels. Then, a ReLU activation function is connected, a 2 x 2 posing layer is connected behind the second convolution layer, 128 pixel feature maps with the size of 32 x 32 are obtained through output after the second pooling layer, finally, a full-connection layer consisting of 512 neurons is connected, and finally, a 512-dimensional feature vector is obtained.
(4) After the emotion feature vector of the voice is obtained, the modal output unit is used for pre-storing and outputting the modal.
4. Emotion analysis module based on electroencephalogram signals
FIG. 4 is a schematic block diagram of an emotion analysis module based on electroencephalogram signals, which mainly comprises a preprocessing denoising unit, a parting dimensional feature extraction unit, a multi-scale entropy feature extraction unit and a modal output unit. The method comprises the following specific steps:
(1) firstly, preprocessing and denoising an electroencephalogram signal, wherein the electroencephalogram signal of an FP1 channel of an accompanying person is removed from a signal acquisition module, the length of the signal acquisition module is 128Hz, the time length of signal acquisition is 63s, the baseline time of the first 3s and the signal time are removed for 60s, and therefore the total number of sampling points is 7680. The system selects db5 wavelet to carry out 5-layer wavelet decomposition, then uses a soft threshold method to set the wavelet packet coefficient generated by noise to zero, and finally completes reconstruction of the electroencephalogram signal. The EEG signal before and after pre-processing can be smoother and suitable for further processing.
(2) In the parting dimension characteristic extraction part, the length of an electroencephalogram signal is fixed to be N, then different K values are uniformly sampled, then new K sequences are constructed, intervals between two adjacent numbers are selected from the K sequences, a new sequence is constructed, then the lengths of all the new sequences are calculated, a slope is obtained by using least square fitting, and finally the opposite number of the slope is removed, namely the required FD initial characteristic. And then, performing window processing on the wavelet threshold value denoised electroencephalogram signals, segmenting data by using a window with the length of 256 points, dividing the data into 30 segments, extracting a parting dimension feature from each segment of data, and obtaining a 30-dimensional feature vector.
(3) In the multi-scale entropy feature extraction unit, multi-scale entropy analyzes the complexity of the time sequence from different time scales. In order to calculate the sample entropy at different time scales, the raw signal needs to be coarse grained. Coarse-grained refers to segmenting the original signal using non-overlapping windows of length i. The entropy values of the samples obtained under different scales are different, and the dimensionality of the formed multi-scale entropy features is also different. The system selects electroencephalogram signals of an FP1 channel of an accompanying detector, extracts electroencephalogram data obtained by all experiments of the accompanying detector, judges a sample with the value dimension larger than 6 in self evaluation as happy emotion, and judges the sample with the value dimension smaller than 4 as sad emotion. Computer brain
And (3) obtaining the multi-scale entropy of the electric signal, solving the average multi-scale entropy value of the happy emotion and the sad emotion in the experiment, then selecting the multi-scale entropy under the first scales as the feature vector of the electroencephalogram signal, and obtaining the 15-dimensional multi-scale soil moisture feature altogether.
(4) And after the parting dimension characteristic and the multi-scale entropy characteristic are obtained, the modal output unit is used for pre-storing and outputting the modal.
5. Feature fusion module based on self-attention mechanism
Fig. 5 is a schematic block diagram of a feature fusion module based on a self-attention mechanism provided in the present invention, wherein the main processes of the attention calculation mechanism include information input, calculation of attention distribution, and calculation of weighted average of input information. And after multi-modal characteristics such as extracted facial expressions, voice, electroencephalogram signals and the like are subjected to effective fusion, a weight matrix is initialized firstly and used for representing the weight value of each modal characteristic, the weight value of each modal is multiplied by the corresponding special value vector in the characteristic fusion process, and the multiplication is then carried out in cascade. In the whole system model training process, the weight matrix is trained along with the system model, and the corresponding value is continuously changed and adjusted according to the training. Compared with a manually fixed weighted value, the effect is better.
The feature fusion module involves interactions between two types of inputs: alpha is alphaiiRespectively, are self-attention interactions of modality i. Alpha is alphaijIs the attention interaction between modalities, reflecting the effect of modality i on modality j. The modal attention is calculated as follows:
in the formula, A is modal attention, (. cndot.) is matrix multiplication, theta is a query matrix, phi is a key matrix, T is a transposed symbol, and d is an embedding dimension;
the construction process of the query matrix theta is as follows: connecting the feature vectors of each mode through a first full connection layer, wherein the connection formula of the first full connection layer is y1=w1x+b1Finally, a characteristic quantity matrix is obtained through activating function output, and a query matrix theta is formed;
the key matrix phi is constructed by the following process: connecting the feature vectors of each mode through a second fully-connected layer, wherein the connection formula of the second fully-connected layer is y2=w2x+b2Finally, a characteristic quantity matrix is obtained through activating function output to form a key matrix phi;
adding the elements of each row in the modal attention A to obtain the weight of the modal i, wherein the corresponding calculation formula is as follows:
Ψi=∑kaki
in the formula, ΨiIs the weight of modality i, akiIs the element in the kth row and the ith column in modal attention A;
the sum of all the modalities being high in weight is 1, i.e.:
∑iΨi=1。
the modal attention A is trained along with the training process of the SVM classification model, and the parameter w in the first full-connection layer and the parameter w in the second full-connection layer are adjusted1、b1、w2And b2。
6. Identification module based on discrete emotion classification
Fig. 6 is a schematic block diagram of the recognition module based on discrete emotion classification according to the present invention, and after the feature vectors of multiple modes are calculated, data normalization is usually required to be performed on the feature vectors. When the data are used for SVM classification, compared with the original data, the normalized data have the advantages that the training time is shortened, the testing accuracy is improved, the data are more compact due to the normalization of the data, and the optimal classification hyperplane can be obtained. The system uses svm-scale to correspondingly scale the data, and scales the data size to [0,1] or [ -1,1], wherein the purpose of scaling is to prevent a certain characteristic from being too large or too small, accelerate the calculation speed and facilitate the training of the model. The system selects an RBF kernel function as a kernel function of an SVM classification algorithm, and the RBF kernel function is a special case corresponding to nonlinear mapping, can process nonlinear separable problems and is suitable for processing multidimensional vectors. After model training, the identification module correctly outputs one of the four emotion labels of happy, sad, calm and aversion.
7. Prediction module based on continuous emotion
Fig. 7 is a schematic block diagram of a prediction module based on continuous emotion, where emotion prediction of continuous dimensions is defined as happiness, tension, excitement, and certainty, emotion prediction is defined as a value range of 0 to 10 in a standard quantization manner, data fitting training is performed on multi-modal emotion features by using a multivariate nonlinear regression method, and finally, the module outputs emotion intensity values corresponding to four different dimensions.
The multi-modal emotion recognition system for the accompanying robot, which is shown in the embodiment, well solves the problems of emotion interaction and recognition of the accompanying robot, and realizes preprocessing and feature extraction of heterogeneous data by fully collecting multi-modal information such as facial expressions, voice signals and difficult signals in combination with different recognition modules, and further realizes multi-modal information fusion of a feature layer based on a self-attention mechanism. Different from the traditional robot carrying emotion recognition system, the invention fully combines discrete emotion classification and continuous dimension emotion prediction, completely depicts the emotion characteristic space of the main body, can comprehensively obtain recognized emotion feedback from the system, predicts the emotion transformation trend of the main body and greatly improves the accuracy of emotion recognition.
Example 2
The embodiment provides a multi-modal emotion recognition method for a companion robot, which corresponds to the processing process of each module in the multi-modal emotion recognition system in embodiment 1, and specifically comprises the following steps:
respectively collecting facial expression pictures, voice signals and electroencephalogram signals;
extracting emotional feature vectors of facial expressions according to the facial expression pictures, extracting emotional feature vectors of voice according to the voice signals, and extracting feature vectors of electroencephalogram signals according to the electroencephalogram signals;
acquiring a weight matrix, and multiplying the emotional feature vector of the facial expression, the emotional feature vector of the voice and the feature vector of the electroencephalogram signal by the weight matrix to obtain a fusion feature;
loading the fusion features into a pre-constructed and trained classification model for classification and identification to obtain a plurality of discrete emotion label identification results, wherein the classification model is also used for training the weight matrix in the training process;
and performing emotion prediction according to the fusion features, wherein the emotion prediction is used for performing data fitting training on the fusion features to obtain continuous emotion intensity values, the emotion intensity values are divided into a plurality of emotion dimensions, and the emotion dimensions comprise pleasure, tension, excitement and certainty.
The extracting of the emotional feature vector of the facial expression specifically includes:
the method comprises the steps of firstly extracting Haar features in a facial expression picture by using an Adaboost algorithm, constructing a Haar feature graph, then preprocessing the Haar feature graph through histogram equalization, and then extracting emotional feature vectors of facial expressions by using a uniform pattern LBP algorithm.
The extraction process of the homogeneous pattern LBP algorithm comprises the following steps:
constructing a texture region with the size of 3 x 3, wherein a threshold value is a central pixel value of the texture region, comparing 8 surrounding pixel values with the threshold value, and if the value is greater than the value of the threshold pixel, setting the neighborhood position as 1; if the value is less than the value of the threshold pixel, the neighborhood is set to 0; in the texture region of 3 x 3, forming 8-bit binary numbers by the values generated by 8 adjacent pixel points according to the clockwise direction, counting the jumping times from 0 to 1 or from 1 to 0 in the 8-bit binary numbers, if the jumping number is less than two times, the decimal number corresponding to the binary number is the LBP value of the center of the 3 x 3 neighborhood; if the jumping times are larger than 2, if P is 8, setting the LBP value of the center of the area as P +1 is 9; and traversing all the pixel points to obtain the LBP value of the whole image, and sequentially connecting all the LBP values into a feature vector, namely the emotional feature vector of the facial expression.
The extracting of the emotion feature vector of the voice specifically includes:
firstly, windowing processing is carried out on a voice signal, a Hamming window is used for carrying out smoothing processing, and a time domain signal is converted into a frequency domain for subsequent spectrum analysis. Then, a high-pass filter is designed to eliminate noise interference of vocal cord pronunciation, and MFCC feature extraction is carried out; and inputting the spectrogram subjected to Fourier transform into a pre-constructed and trained convolutional neural network layer, and extracting spectrogram features to obtain the emotional feature vector of the voice.
The extracting of the feature vector of the electroencephalogram signal specifically comprises:
firstly, preprocessing and denoising electroencephalogram signals, then respectively extracting parting dimensional features and multi-scale entropy features, and constructing feature vectors of the electroencephalogram signals.
The pre-processing denoising comprises:
collecting the electroencephalogram signals of a main body through fixed sampling frequency, then selecting db5 wavelet to carry out multi-layer wavelet decomposition, then using a soft threshold method to set the wavelet packet coefficient generated by noise to zero, and finally completing the reconstruction of the electroencephalogram signals.
The extraction process of the parting dimension features comprises the following steps:
after the original sequence is analyzed, a new sequence signal is obtained by sampling, window processing is carried out on the preprocessed and denoised electroencephalogram signal, then the data are divided into a plurality of sections by using a window, and the parting dimensional characteristics are respectively extracted according to each section of data.
The extraction process of the multi-scale entropy features comprises the following steps:
and calculating the multi-scale moisture of the electroencephalogram signal, solving the average multi-scale moisture value of the happy emotion and the sad emotion of the tested subject, and then selecting the multi-scale entropy under the previous one or more scales as the feature vector of the electroencephalogram signal.
Before the classification recognition, the method further comprises the following steps: respectively carrying out data normalization on the emotional feature vectors of the facial expressions, the emotional feature vectors of the voices and the feature vectors of the electroencephalogram signals to obtain fusion features, and carrying out classification and identification; the classification model adopts an SVM classification model, and the kernel function of the SVM classification model is an RBF kernel function; the emotion tags include happy, sad, calm, and disliked.
The foregoing detailed description of the preferred embodiments of the invention has been presented. It should be understood that numerous modifications and variations could be devised by those skilled in the art in light of the present teachings without departing from the inventive concepts. Therefore, the technical solutions available to those skilled in the art through logic analysis, reasoning and limited experiments based on the prior art according to the concept of the present invention should be within the scope of protection defined by the claims.
Claims (10)
1. A multi-mode emotion recognition method for a companion robot is characterized by comprising the following steps of:
respectively collecting facial expression pictures, voice signals and electroencephalogram signals;
extracting emotional feature vectors of facial expressions according to the facial expression pictures, extracting emotional feature vectors of voice according to the voice signals, and extracting feature vectors of electroencephalogram signals according to the electroencephalogram signals;
acquiring a weight matrix, and multiplying the emotional feature vector of the facial expression, the emotional feature vector of the voice and the feature vector of the electroencephalogram signal by the weight matrix to obtain a fusion feature;
loading the fusion features into a pre-constructed and trained classification model for classification and identification to obtain a plurality of discrete emotion label identification results, wherein the classification model is also used for training the weight matrix in the training process;
and performing emotion prediction according to the fusion features, wherein the emotion prediction is used for performing data fitting training on the fusion features to obtain continuous emotion intensity values, the emotion intensity values are divided into a plurality of emotion dimensions, and the emotion dimensions comprise pleasure, tension, excitement and certainty.
2. The multi-modal emotion recognition method for a companion robot as claimed in claim 1, wherein the extracting emotion feature vectors of facial expressions specifically comprises:
the method comprises the steps of firstly extracting Haar features in a facial expression picture by using an Adaboost algorithm, constructing a Haar feature graph, then preprocessing the Haar feature graph through histogram equalization, and then extracting emotional feature vectors of facial expressions by using a uniform pattern LBP algorithm.
3. The multi-modal emotion recognition method for a companion robot as claimed in claim 2, wherein the extraction process of the homogeneous pattern LBP algorithm comprises:
constructing a texture region with the size of 3 x 3, wherein a threshold value is a central pixel value of the texture region, comparing 8 surrounding pixel values with the threshold value, and if the value is greater than the value of a threshold pixel, setting the region where the pixel is located as 1; if the value is less than the value of the threshold pixel, the area where the pixel is located is set to 0; in the texture region of 3 x 3, forming 8-bit binary numbers by the values generated by 8 adjacent pixel points according to the clockwise direction, counting the jumping times from 0 to 1 or from 1 to 0 in the 8-bit binary numbers, if the jumping number is less than two times, the decimal number corresponding to the binary number is the LBP value of the center of the 3 x 3 neighborhood; if the jumping times are larger than 2, if P is 8, setting the LBP value of the center of the area as P +1 is 9; and traversing all the pixel points to obtain the LBP value of the whole image, and sequentially connecting all the LBP values into a feature vector, namely the emotional feature vector of the facial expression.
4. The multi-modal emotion recognition method for a companion robot as claimed in claim 1, wherein the extracting emotion feature vectors of speech specifically comprises:
firstly, windowing processing is carried out on a voice signal, a Hamming window is used for carrying out smoothing processing, and a time domain signal is converted into a frequency domain for subsequent spectrum analysis; then, a high-pass filter is designed to eliminate noise interference of vocal cord pronunciation, and MFCC feature extraction is carried out; and inputting the spectrogram subjected to Fourier transform into a pre-constructed and trained convolutional neural network layer, and extracting spectrogram features to obtain the emotional feature vector of the voice.
5. The multi-modal emotion recognition method for a companion robot as claimed in claim 1, wherein said extracting feature vectors of electroencephalogram signals specifically comprises:
firstly, preprocessing and denoising electroencephalogram signals, then respectively extracting parting dimensional features and multi-scale entropy features, and constructing feature vectors of the electroencephalogram signals.
6. The multi-modal emotion recognition method for a companion robot as recited in claim 5, wherein the pre-processing denoising comprises:
collecting the electroencephalogram signals of a main body through fixed sampling frequency, then selecting db5 wavelet to carry out multi-layer wavelet decomposition, then using a soft threshold method to set the wavelet packet coefficient generated by noise to zero, and finally completing the reconstruction of the electroencephalogram signals.
7. The multi-modal emotion recognition method for a companion robot as claimed in claim 6, wherein the extraction process of the fractal dimension features comprises:
uniformly sampling an original sequence to obtain K sequences, calculating the variable quantity of each element in the K sequences, constructing a new sequence, fitting the new sequence to obtain a slope, and taking the opposite number of the slope as the FD initial characteristic; performing window processing by utilizing the preprocessed and denoised electroencephalogram signals, dividing data into a plurality of sections by using a window, and respectively extracting parting dimensional characteristics according to each section of data;
the extraction process of the multi-scale entropy features comprises the following steps:
and calculating the multi-scale moisture of the electroencephalogram signal, solving the average multi-scale moisture value of the happy emotion and the sad emotion of the tested subject, and then selecting the multi-scale entropy under the previous one or more scales as the feature vector of the electroencephalogram signal.
8. The multi-modal emotion recognition method for a companion robot as recited in claim 1, further comprising, prior to said classification recognition: respectively carrying out data normalization on the emotional feature vectors of the facial expressions, the emotional feature vectors of the voices and the feature vectors of the electroencephalogram signals to obtain fusion features, and carrying out classification and identification; the classification model adopts an SVM classification model, and the kernel function of the SVM classification model is an RBF kernel function; the emotion tags include happy, sad, calm, and disliked.
9. The multi-modal emotion recognition method for a companion robot as claimed in claim 1, wherein the weight matrix is constructed by calculating modal attention through a self-attention mechanism to obtain fusion weights, and the calculation expression of the modal attention is as follows:
in the formula, A is modal attention, (. cndot.) is matrix multiplication, theta is a query matrix, phi is a key matrix, T is a transposed symbol, and d is an embedding dimension;
the construction process of the query matrix theta is as follows: connecting the feature vectors of each mode through a first full connection layer, wherein the connection formula of the first full connection layer is y1=w1x+b1Finally, a characteristic quantity matrix is obtained through activating function output, and a query matrix theta is formed; the query matrix Θ is used to represent the influence of the current modality on other modalities;
the key matrix phi is constructed by the following steps: connecting the feature vectors of each mode through a second fully-connected layer, wherein the connection formula of the second fully-connected layer is y2=w2x+b2Finally, a characteristic quantity matrix is obtained through activating function output to form a key matrix phi; the key matrix phi is used for representing the influence of other modalities on the current modality;
adding the elements of each row in the modal attention A to obtain the weight of the modal i, wherein the corresponding calculation formula is as follows:
Ψi=∑kaki
in the formula, ΨiIs the weight of modality i, akiIs the element in the kth row and the ith column in modal attention A;
the sum of all the modalities being high in weight is 1, i.e.:
∑iΨi=1
the modal attention A is trained together with the training process of the classification model to adjust the parameter w in the first fully-connected layer and the second fully-connected layer1、b1、w2And b2。
10. A system adopting the multi-modal emotion recognition method for a companion robot as set forth in any one of claims 1 to 9, comprising:
the multi-mode acquisition module is used for respectively acquiring facial expression pictures, voice signals and electroencephalogram signals;
the emotion analysis module is used for extracting emotion feature vectors of the facial expressions according to the facial expression pictures;
the emotion analysis module is used for extracting emotion feature vectors of the voice according to the voice signals;
the emotion analysis module is used for extracting feature vectors of the electroencephalogram signals according to the electroencephalogram signals;
the feature fusion module based on the self-attention mechanism is used for acquiring a weight matrix, and multiplying the emotional feature vector of the facial expression, the emotional feature vector of the voice and the feature vector of the electroencephalogram signal by the weight matrix to obtain fusion features;
the recognition module based on discrete emotion classification is used for loading the fusion features into a pre-constructed and trained classification model for classification recognition to obtain a plurality of discrete emotion label recognition results, and is also used for training the weight matrix in the training process of the classification model;
and the emotion prediction module is used for carrying out emotion prediction according to the fusion characteristics, and the emotion prediction is used for carrying out data fitting training on the fusion characteristics to obtain continuous emotion intensity values which are divided into a plurality of emotion dimensions, wherein the emotion dimensions comprise pleasure, tension, excitement and certainty.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111079583.8A CN113947127A (en) | 2021-09-15 | 2021-09-15 | Multi-mode emotion recognition method and system for accompanying robot |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111079583.8A CN113947127A (en) | 2021-09-15 | 2021-09-15 | Multi-mode emotion recognition method and system for accompanying robot |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113947127A true CN113947127A (en) | 2022-01-18 |
Family
ID=79328488
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111079583.8A Pending CN113947127A (en) | 2021-09-15 | 2021-09-15 | Multi-mode emotion recognition method and system for accompanying robot |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113947127A (en) |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114209323A (en) * | 2022-01-21 | 2022-03-22 | 中国科学院计算技术研究所 | Method for recognizing emotion and emotion recognition model based on electroencephalogram data |
CN114565964A (en) * | 2022-03-03 | 2022-05-31 | 网易(杭州)网络有限公司 | Emotion recognition model generation method, recognition method, device, medium and equipment |
CN115035438A (en) * | 2022-05-27 | 2022-09-09 | 中国科学院半导体研究所 | Emotion analysis method and device and electronic equipment |
CN115064246A (en) * | 2022-08-18 | 2022-09-16 | 山东第一医科大学附属省立医院(山东省立医院) | Depression evaluation system and equipment based on multi-mode information fusion |
CN115410561A (en) * | 2022-11-02 | 2022-11-29 | 中汽数据有限公司 | Voice recognition method, device, medium and equipment based on vehicle-mounted multimode interaction |
CN116543445A (en) * | 2023-06-29 | 2023-08-04 | 新励成教育科技股份有限公司 | Method, system, equipment and storage medium for analyzing facial expression of speaker |
CN116561533A (en) * | 2023-07-05 | 2023-08-08 | 福建天晴数码有限公司 | Emotion evolution method and terminal for virtual avatar in educational element universe |
CN117056863A (en) * | 2023-10-10 | 2023-11-14 | 湖南承希科技有限公司 | Big data processing method based on multi-mode data fusion |
CN117494013A (en) * | 2023-12-29 | 2024-02-02 | 南方医科大学南方医院 | Multi-scale weight sharing convolutional neural network and electroencephalogram emotion recognition method thereof |
CN117520826A (en) * | 2024-01-03 | 2024-02-06 | 武汉纺织大学 | Multi-mode emotion recognition method and system based on wearable equipment |
-
2021
- 2021-09-15 CN CN202111079583.8A patent/CN113947127A/en active Pending
Cited By (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114209323B (en) * | 2022-01-21 | 2024-05-10 | 中国科学院计算技术研究所 | Method for identifying emotion and emotion identification model based on electroencephalogram data |
CN114209323A (en) * | 2022-01-21 | 2022-03-22 | 中国科学院计算技术研究所 | Method for recognizing emotion and emotion recognition model based on electroencephalogram data |
CN114565964A (en) * | 2022-03-03 | 2022-05-31 | 网易(杭州)网络有限公司 | Emotion recognition model generation method, recognition method, device, medium and equipment |
CN115035438A (en) * | 2022-05-27 | 2022-09-09 | 中国科学院半导体研究所 | Emotion analysis method and device and electronic equipment |
CN115064246A (en) * | 2022-08-18 | 2022-09-16 | 山东第一医科大学附属省立医院(山东省立医院) | Depression evaluation system and equipment based on multi-mode information fusion |
CN115064246B (en) * | 2022-08-18 | 2022-12-20 | 山东第一医科大学附属省立医院(山东省立医院) | Depression evaluation system and equipment based on multi-mode information fusion |
CN115410561A (en) * | 2022-11-02 | 2022-11-29 | 中汽数据有限公司 | Voice recognition method, device, medium and equipment based on vehicle-mounted multimode interaction |
CN115410561B (en) * | 2022-11-02 | 2023-02-17 | 中汽数据有限公司 | Voice recognition method, device, medium and equipment based on vehicle-mounted multimode interaction |
CN116543445A (en) * | 2023-06-29 | 2023-08-04 | 新励成教育科技股份有限公司 | Method, system, equipment and storage medium for analyzing facial expression of speaker |
CN116543445B (en) * | 2023-06-29 | 2023-09-26 | 新励成教育科技股份有限公司 | Method, system, equipment and storage medium for analyzing facial expression of speaker |
CN116561533A (en) * | 2023-07-05 | 2023-08-08 | 福建天晴数码有限公司 | Emotion evolution method and terminal for virtual avatar in educational element universe |
CN116561533B (en) * | 2023-07-05 | 2023-09-29 | 福建天晴数码有限公司 | Emotion evolution method and terminal for virtual avatar in educational element universe |
CN117056863A (en) * | 2023-10-10 | 2023-11-14 | 湖南承希科技有限公司 | Big data processing method based on multi-mode data fusion |
CN117056863B (en) * | 2023-10-10 | 2023-12-26 | 湖南承希科技有限公司 | Big data processing method based on multi-mode data fusion |
CN117494013A (en) * | 2023-12-29 | 2024-02-02 | 南方医科大学南方医院 | Multi-scale weight sharing convolutional neural network and electroencephalogram emotion recognition method thereof |
CN117494013B (en) * | 2023-12-29 | 2024-04-16 | 南方医科大学南方医院 | Multi-scale weight sharing convolutional neural network and electroencephalogram emotion recognition method thereof |
CN117520826A (en) * | 2024-01-03 | 2024-02-06 | 武汉纺织大学 | Multi-mode emotion recognition method and system based on wearable equipment |
CN117520826B (en) * | 2024-01-03 | 2024-04-05 | 武汉纺织大学 | Multi-mode emotion recognition method and system based on wearable equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113947127A (en) | Multi-mode emotion recognition method and system for accompanying robot | |
CN108805087B (en) | Time sequence semantic fusion association judgment subsystem based on multi-modal emotion recognition system | |
CN108899050B (en) | Voice signal analysis subsystem based on multi-modal emotion recognition system | |
CN108682431B (en) | Voice emotion recognition method in PAD three-dimensional emotion space | |
CN114176607B (en) | Electroencephalogram signal classification method based on vision transducer | |
CN104809450B (en) | Wrist vena identification system based on online extreme learning machine | |
Jayanthi et al. | An integrated framework for emotion recognition using speech and static images with deep classifier fusion approach | |
Mini et al. | EEG based direct speech BCI system using a fusion of SMRT and MFCC/LPCC features with ANN classifier | |
Kumar et al. | Artificial Emotional Intelligence: Conventional and deep learning approach | |
He et al. | What catches the eye? Visualizing and understanding deep saliency models | |
CN112418166A (en) | Emotion distribution learning method based on multi-mode information | |
Zhang et al. | DeepVANet: a deep end-to-end network for multi-modal emotion recognition | |
CN113951883B (en) | Gender difference detection method based on electroencephalogram signal emotion recognition | |
CN117058597B (en) | Dimension emotion recognition method, system, equipment and medium based on audio and video | |
Priatama et al. | Hand gesture recognition using discrete wavelet transform and convolutional neural network | |
Morade et al. | Comparison of classifiers for lip reading with CUAVE and TULIPS database | |
Chinmayi et al. | Emotion Classification Using Deep Learning | |
HR et al. | A novel hybrid biometric software application for facial recognition considering uncontrollable environmental conditions | |
Kächele et al. | Fusion mappings for multimodal affect recognition | |
Dixit et al. | Multi-feature based automatic facial expression recognition using deep convolutional neural network | |
CN116343287A (en) | Facial expression recognition and model training method, device, equipment and storage medium | |
Moran | Classifying emotion using convolutional neural networks | |
Sushma et al. | Emotion analysis using signal and image processing approach by implementing deep neural network | |
Kim et al. | A study on user recognition using 2D ECG image based on ensemble networks for intelligent vehicles | |
Gaus et al. | Automatic affective dimension recognition from naturalistic facial expressions based on wavelet filtering and PLS regression |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |