CN113947127A

CN113947127A - Multi-mode emotion recognition method and system for accompanying robot

Info

Publication number: CN113947127A
Application number: CN202111079583.8A
Authority: CN
Inventors: 张立华; 黄帅; 杨鼎康; 王顺利; 邝昊鹏
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2021-09-15
Filing date: 2021-09-15
Publication date: 2022-01-18

Abstract

The invention relates to a multi-modal emotion recognition method and a system for a companion robot, wherein the method comprises the steps of respectively collecting facial expression pictures, voice signals and electroencephalogram signals; extracting emotional feature vectors of facial expressions, emotional feature vectors of voice and feature vectors of electroencephalogram signals; acquiring a weight matrix, and multiplying each eigenvector by the weight matrix to obtain fusion characteristics; the classification of four types of common emotions, namely happiness, sadness, calmness and disgust, is realized through a support vector machine; and (3) forecasting the emotion score by adopting a multivariate nonlinear regression mode through evolving the emotion into four dimensions of pleasure-tension-excitement-certainty degree. Compared with the prior art, the method has the emotion recognition capability closer to human through information fusion; the ability of autonomous evolution and continuous adjustment of emotion judgment is realized by utilizing the form of dynamically updating the weight parameters; discrete and continuous emotion recognition can realize more scientific and deep description of emotion change.

Description

Multi-mode emotion recognition method and system for accompanying robot

Technical Field

The invention relates to the technical field of emotion recognition, in particular to a multi-mode emotion recognition method and system for a companion robot.

Background

The emotion interaction is paid great attention in the research of human-computer natural interaction, emotion recognition is the key of human-computer emotion interaction, and the research aims to enable a machine to sense the emotional state of human beings and improve the humanization level of the machine. The multi-modal emotion recognition technology has wide application prospect and research value in the field of companion robots. By utilizing various sensors carried by the robot, multi-mode signals such as human facial expressions, behaviors, voice, physiological signals and the like containing potential emotion characteristics are obtained, the characteristics are extracted and fused based on a deep learning method, and the emotion of the human is analyzed and predicted, so that the companion robot has stronger emotion recognition and emotion understanding capabilities.

At present, the emotion recognition device is suitable for a companion robot to carry, and the physiological characteristics, postures, gestures, intonations and other waveform changes caused by human emotion changes are generally analyzed and identified in detail through a chip, a video system, an audio system and other systems, so that the emotion of a human is understood deeply, and clear and timely responses are presented. Emotion recognition based on facial expressions is generally performed using two-dimensional images for analysis and study, i.e., geometric methods based on the facial organs and the facial raised position features, pixel methods based on facial texture features, and hybrid methods combining the two. The emotion recognition method based on voice usually extracts and induces prosodic information and voice quality features in voice signals, wherein the prosodic information and the voice quality features comprise Mel Frequency Cepstrum Coefficients (MFCC), Teager energy operators and the like, and emotion classification and recognition are realized through a support vector machine or a long-short term memory network. In the aspect of physiological signals, emotion understanding is carried out on the basis of a traditional machine learning mode and a pulse neural network by utilizing signal frequency bands most relevant to emotion, time stability characteristics of brain areas and electroencephalogram and the like. Part of the current research is integrating features from various behavioral and physiological manifestations in a framework of emotion recognition for emotion recognition. The behavioral corresponding mental state is inferred, for example, from a combination of head movements and facial expressions, which in turn infer emotional expressions of humans. Meanwhile, discrete emotion classification and recognition are realized by combining expressions and voice signals and utilizing a mode of expressing subspace among shared modes.

Most of the conventional accompanying robots lack the function of emotion recognition. And the robot partially carrying a specific sensor can only realize a simple emotion recognition function based on a single mode. The emotion recognition technology based on facial expressions or voice signals only does not consider the complementarity and the promotion effect among emotion expressions acquired in different modes, and the emotion recognition efficiency is low under the condition that corresponding emotion information is interfered or the information acquisition is insufficient, so that the application requirement of emotion interaction cannot be met; meanwhile, most of the existing technical means aim at the external emotion expression characteristics of human beings to be identified, and the important significance of monitoring and detecting physiological signals to emotion identification is not considered. The electroencephalogram and nerve signals can accurately, objectively and real-timely reflect the abnormal emotion and psychological state change, and the accompanying robot is beneficial to performing emotion analysis on different accompanying people and achieving the purpose of accurate emotion comfort.

In recent years. The emotion recognition function module carried by the accompanying robot can only perform simple preprocessing on the acquired modal information, so that the problems of data loss and errors often occur; on the basis, data set fusion is generally adopted for a multi-modal data fusion mode, namely complex and tedious data fusion is carried out under the condition that the data integrity cannot be guaranteed, and the data resource waste is greatly caused. In addition, most of the traditional methods adopt a discrete emotion recognition strategy, and the continuity and the heterogeneity of human emotion changes are not fully considered, so that the performance of emotion recognition is usually poor.

In summary, a method based on multi-modal feature acquisition and expression is developed, multi-modal emotion characterization data of facial expressions, voices and physiological signals are used, the discriminative power of heterogeneous features is fully expressed, the existing single-modal data research difficulty is overcome, and the construction of a multi-modal emotion recognition system suitable for a companion robot becomes a problem to be solved by technical personnel in the research field.

Disclosure of Invention

The present invention is directed to overcoming the above-mentioned drawbacks of the prior art, and providing a method and a system for multi-modal emotion recognition for a companion robot, which fully express the discriminative power of heterogeneous features by using multi-modal emotion characterization data of facial expressions, speech and physiological signals.

The purpose of the invention can be realized by the following technical scheme:

a multi-mode emotion recognition method for a companion robot comprises the following steps:

respectively collecting facial expression pictures, voice signals and electroencephalogram signals;

extracting emotional feature vectors of facial expressions according to the facial expression pictures, extracting emotional feature vectors of voice according to the voice signals, and extracting feature vectors of electroencephalogram signals according to the electroencephalogram signals;

acquiring a weight matrix, and multiplying the emotional feature vector of the facial expression, the emotional feature vector of the voice and the feature vector of the electroencephalogram signal by the weight matrix to obtain a fusion feature;

loading the fusion features into a pre-constructed and trained classification model for classification and identification to obtain a plurality of discrete emotion label identification results, wherein the classification model is also used for training the weight matrix in the training process;

and performing emotion prediction according to the fusion features, wherein the emotion prediction is used for performing data fitting training on the fusion features to obtain continuous emotion intensity values, the emotion intensity values are divided into a plurality of emotion dimensions, and the emotion dimensions comprise pleasure, tension, excitement and certainty.

Further, the extracting the emotional feature vector of the facial expression specifically includes:

the method comprises the steps of firstly extracting Haar features in a facial expression picture by using an Adaboost algorithm, constructing a Haar feature graph, then preprocessing the Haar feature graph through histogram equalization, and then extracting emotional feature vectors of facial expressions by using a uniform pattern LBP algorithm.

Further, the extraction process of the homogeneous pattern LBP algorithm includes:

constructing a texture region with the size of 3 x 3, wherein a threshold value is a central pixel value of the texture region, comparing 8 surrounding pixel values with the threshold value, and if the value is greater than the value of a threshold pixel, setting the region where the pixel is located as 1; if the value is less than the value of the threshold pixel, the area where the pixel is located is set to 0; in the texture region of 3 x 3, forming 8-bit binary numbers by the values generated by 8 adjacent pixel points according to the clockwise direction, counting the jumping times from 0 to 1 or from 1 to 0 in the 8-bit binary numbers, if the jumping number is less than two times, the decimal number corresponding to the binary number is the LBP value of the center of the 3 x 3 neighborhood; if the jumping times are larger than 2, if P is 8, setting the LBP value of the center of the area as P +1 is 9; and traversing all the pixel points to obtain the LBP value of the whole image, and sequentially connecting all the LBP values into a feature vector, namely the emotional feature vector of the facial expression.

Further, the extracting the emotion feature vector of the speech specifically includes:

firstly, windowing processing is carried out on a voice signal, a Hamming window is used for carrying out smoothing processing, and a time domain signal is converted into a frequency domain for subsequent spectrum analysis; then, a high-pass filter is designed to eliminate noise interference of vocal cord pronunciation, and MFCC feature extraction is carried out; and inputting the spectrogram subjected to Fourier transform into a pre-constructed and trained convolutional neural network layer, and extracting spectrogram features to obtain the emotional feature vector of the voice.

Further, the extracting the feature vector of the electroencephalogram signal specifically includes:

firstly, preprocessing and denoising electroencephalogram signals, then respectively extracting parting dimensional features and multi-scale entropy features, and constructing feature vectors of the electroencephalogram signals.

Further, the preprocessing denoising comprises:

collecting the electroencephalogram signals of a main body through fixed sampling frequency, then selecting db5 wavelet to carry out multi-layer wavelet decomposition, then using a soft threshold method to set the wavelet packet coefficient generated by noise to zero, and finally completing the reconstruction of the electroencephalogram signals.

Further, the extraction process of the typing dimensional features comprises the following steps:

uniformly sampling an original sequence to obtain K sequences, calculating the variable quantity of each element in the K sequences, constructing a new sequence, fitting the new sequence to obtain a slope, and taking the opposite number of the slope as the FD initial characteristic; performing window processing by utilizing the preprocessed and denoised electroencephalogram signals, dividing data into a plurality of sections by using a window, and respectively extracting parting dimensional characteristics according to each section of data;

the extraction process of the multi-scale entropy features comprises the following steps:

and calculating the multi-scale moisture of the electroencephalogram signal, solving the average multi-scale moisture value of the happy emotion and the sad emotion of the tested subject, and then selecting the multi-scale entropy under the previous one or more scales as the feature vector of the electroencephalogram signal.

Further, before the classification recognition, the method further comprises: respectively carrying out data normalization on the emotional feature vectors of the facial expressions, the emotional feature vectors of the voices and the feature vectors of the electroencephalogram signals to obtain fusion features, and carrying out classification and identification; the classification model adopts an SVM classification model, and the kernel function of the SVM classification model is an RBF kernel function; the emotion tags include happy, sad, calm, and disliked.

Further, calculating modal attention through a self-attention mechanism, so as to construct the weight matrix and obtain a fusion weight, wherein a calculation expression of the modal attention is as follows:

in the formula, A is modal attention, (. cndot.) is matrix multiplication, theta is a query matrix, phi is a key matrix, T is a transposed symbol, and d is an embedding dimension;

the construction process of the query matrix theta is as follows: connecting the feature vectors of each mode through a first full connection layer, wherein the connection formula of the first full connection layer is y₁＝w₁x+b₁Finally, a characteristic quantity matrix is obtained through activating function output, and a query matrix theta is formed; the query matrix Θ is used to represent the influence of the current modality on other modalities;

the key matrix phi is constructed by the following steps: connecting the feature vectors of each mode through a second fully-connected layer, wherein the connection formula of the second fully-connected layer is y₂＝w₂x+b₂Finally, a characteristic quantity matrix is obtained through activating function output to form a key matrix phi; the key matrix phi is used for representing the influence of other modalities on the current modality;

adding the elements of each row in the modal attention A to obtain the weight of the modal i, wherein the corresponding calculation formula is as follows:

Ψ_i＝∑_ka_ki

in the formula, Ψ_iIs the weight of modality i, a_kiIs the element in the kth row and the ith column in modal attention A;

the sum of all the modalities being high in weight is 1, i.e.:

∑_iΨ_i＝1。

the modal attention A is trained along with the training process of the classification model to adjustParameter w in first and second fully-connected layers₁、b₁、w₂And b₂。

The invention also provides a multi-modal emotion recognition system for a companion robot, which comprises:

the multi-mode acquisition module is used for respectively acquiring facial expression pictures, voice signals and electroencephalogram signals;

the emotion analysis module is used for extracting emotion feature vectors of the facial expressions according to the facial expression pictures;

the emotion analysis module is used for extracting emotion feature vectors of the voice according to the voice signals;

the emotion analysis module is used for extracting feature vectors of the electroencephalogram signals according to the electroencephalogram signals;

the feature fusion module based on the self-attention mechanism is used for acquiring a weight matrix, and multiplying the emotional feature vector of the facial expression, the emotional feature vector of the voice and the feature vector of the electroencephalogram signal by the weight matrix to obtain fusion features;

the recognition module based on discrete emotion classification is used for loading the fusion features into a pre-constructed and trained classification model for classification recognition to obtain a plurality of discrete emotion label recognition results, and is also used for training the weight matrix in the training process of the classification model;

and the emotion prediction module is used for carrying out emotion prediction according to the fusion characteristics, and the emotion prediction is used for carrying out data fitting training on the fusion characteristics to obtain continuous emotion intensity values which are divided into a plurality of emotion dimensions, wherein the emotion dimensions comprise pleasure, tension, excitement and certainty.

Compared with the prior art, the invention has the following advantages:

different from the traditional single-modal emotion recognition system, the emotion recognition method and the emotion recognition system fully combine facial expressions, voice signals and electroencephalogram signals to carry out emotion analysis and discrimination, and can enhance the enlarging capability of emotion characteristics and perfect the mapping capability of emotion characterization space through multi-modal information fusion, so that the robot can show the emotion recognition capability closer to that of human beings.

Meanwhile, the multi-mode fusion mode based on the self-attention mechanism has self-adaptability and flexibility, and the robot has the capabilities of autonomous evolution and continuously adjusting emotion judgment by combining emotion expression advantages of different heterogeneous modes and utilizing a form of dynamically updating weight parameters, so that a new paradigm is provided for man-machine emotion interaction.

In addition, the multi-mode emotion state is effectively described through the combined processing of the emotion recognition modes based on discrete emotion classification and continuous emotion dimension prediction, emotion changes can be described more scientifically and deeply, and the method has obvious advantages for the robot to widely understand human emotions. The system reduces the loss of a large amount of information when facing complex nonlinear multi-modal information processing, and performs well when processing data with large modal span.

Drawings

Fig. 1 is a schematic block diagram of a multi-modal emotion recognition system for a companion robot provided in an embodiment of the present invention;

FIG. 2 is a schematic block diagram of an emotion analysis module based on facial expressions provided in an embodiment of the present invention;

FIG. 3 is a schematic block diagram of an emotion analysis module based on speech signals provided in an embodiment of the present invention;

FIG. 4 is a schematic block diagram of an emotion analysis module based on electroencephalogram signals provided in an embodiment of the present invention;

FIG. 5 is a schematic block diagram of a feature fusion module based on a self-attention mechanism provided in an embodiment of the present invention;

FIG. 6 is a schematic block diagram of an identification module based on discrete emotion classification according to an embodiment of the present invention;

FIG. 7 is a schematic block diagram of a prediction module based on continuous emotion according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.

Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example 1

As shown in fig. 1, the present embodiment provides a multimodal emotion recognition system for a companion robot, including:

the prediction module based on continuous emotion is used for carrying out emotion prediction according to the fusion features, the emotion prediction is used for carrying out data fitting training on the fusion features to obtain continuous emotion intensity values, the emotion intensity values are divided into a plurality of emotion dimensions, and the emotion dimensions comprise pleasure, tension, excitement and certainty;

and the display module is used for displaying the results of the discrete emotion classification and the continuous emotion prediction in real time.

In summary, the multi-modal information is derived from facial expressions, voice signals and electroencephalogram signals acquired by the accompanying robot through the acquisition module. The operations of preprocessing, data feature extraction, modal output and the like of the multi-modal information are realized through recognition units with different modalities in the recognition module. And (3) calculating attention weight coefficients for different modes by introducing a self-attention mechanism in the fusion module so as to realize the mode fusion of the characteristic level, and displaying the multi-mode emotion recognition result by using the display module after passing through the emotion classification and prediction module.

Each module is described in detail below.

1. Multi-modal acquisition module

The multi-mode acquisition module comprises an emotion acquisition device for facial expressions, an emotion acquisition device for voice signals, an emotion acquisition device for electroencephalogram signals, an emotion acquisition device for physiological signals and the like, and multi-mode data collection operation of an accompanying main body aimed by the accompanying robot is achieved through the multi-sensor.

2. Emotion analysis module based on facial expressions

Fig. 2 is a schematic block diagram of an emotion analysis module based on facial expressions according to the present invention, which includes four units of face detection, image and processing, feature extraction, and modal output, and based on different units, the specific steps are as follows:

(1) firstly, target detection is realized by using Adaboost combined with Haar characteristics. The Adaboost algorithm extracts the Haar-like features of the face, including rectangular features of the input image. The Haar feature is a feature reflecting the gray level change of an image, a black rectangle and a white rectangle are combined into a feature template, and the Haar feature value of the template is calculated by subtracting the sum of the pixels of the white rectangle from the sum of the pixels of the black rectangle. The common features comprise edge features, line features, center surrounding features and diagonal features, the outline of the five sense organs of the human face has color difference with the face, and the gray level change of the human face can be described by using Haar features. In order to achieve fast calculation, an integral graph approach is used. The integral map can quickly calculate the sum of pixels of any rectangular area in the image, so that the Haar characteristic of the image can be quickly calculated.

(2) The image preprocessing can recover useful information in the image and reduce irrelevant information in the image, wherein the histogram equalization is used for changing the histogram distribution of the image into approximately uniform distribution, so that the contrast of the image is enhanced.

(3) In a feature extraction unit, considering the requirement of low complexity of a system suitable for a companion robot, extracting facial expression features by adopting a uniform pattern LBP algorithm, and comparing 8 surrounding pixel values with a threshold value by constructing a 3 x 3 texture region, wherein the threshold value is the central pixel value of the texture region, and if the value is greater than the value of the threshold pixel, the neighborhood position is set to be 1; if the value is less than the threshold pixel value, the neighborhood is set to 0. And in the 3 x 3 area, forming 8-bit binary numbers by the values generated by 8 adjacent pixel points according to the clockwise direction. Counting the number of 0 to 1 or 1 to 0 jump in 8-bit binary number, if the jump number is less than two times, the decimal number corresponding to the binary number is the LBP value of the 3 x 3 neighborhood center; if the jumping times are larger than 2, if P is 8, setting the LBP value of the center of the area as P +1 is 9; and traversing all the pixel points to obtain the LBP value of the whole image, and sequentially connecting all the LBP values into a feature vector, namely the uniform mode LBP feature of the whole image.

(4) After the emotional characteristic vector of the facial expression is obtained, the modal output unit is used for pre-storing and outputting the modal.

3. Emotion analysis module based on voice signal

Fig. 3 is a schematic block diagram of an emotion analysis module based on speech signals, which mainly includes a data preprocessing unit, an MFCC feature extraction unit, a spectrogram feature extraction unit, and a modal output unit. The basic steps of the unit are as follows:

(1) in order to make the obtained original signal smoother, windowing is carried out on the signal, a Hamming window is used for carrying out smoothing, and the time domain signal is converted into a frequency domain for subsequent spectrum analysis.

(2) For the MFCC feature extraction stage, the speech signal is converted into a Mel frequency and a Hertz frequency, and then the cepstrum analysis is carried out, the method has the idea that in the process of speaking, due to other interferences generated by lips and vocal cords, in order to solve the interferences, a step-length speech signal is subjected to a high-frequency part suppressed by a pronunciation system, and a formant of the high frequency can be highlighted, so that a high-pass filter is added. By increasing the coefficient on the frequency domain, and forming positive correlation with the frequency, the amplitude of the high frequency is further improved.

(3) In the spectrogram feature extraction unit, the spectrogram after Fourier transform is input into a convolutional neural network layer to extract spectrogram features, and the process comprises an input layer, a pooling layer, a convolutional layer and a full-link layer. The CNN structure comprises an input layer, 2 convolution layers, 2 pooling layers and a full-connection layer. The size of an input image is 128 pixels by 128 pixels, the convolution layer of the first layer is composed of 64 convolution kernels of 5 pixels by 5 pixels, a nonlinear unit ReLU activation function is introduced after convolution, a 2 pixel by 2 convolution kernel is connected after convolution to form a pooling layer, and the purpose of connecting the pooling layer is to reduce the calculation complexity and extract main features. The second layer of convolution consists of 5 x 5 of 128 convolution kernels. Then, a ReLU activation function is connected, a 2 x 2 posing layer is connected behind the second convolution layer, 128 pixel feature maps with the size of 32 x 32 are obtained through output after the second pooling layer, finally, a full-connection layer consisting of 512 neurons is connected, and finally, a 512-dimensional feature vector is obtained.

(4) After the emotion feature vector of the voice is obtained, the modal output unit is used for pre-storing and outputting the modal.

4. Emotion analysis module based on electroencephalogram signals

FIG. 4 is a schematic block diagram of an emotion analysis module based on electroencephalogram signals, which mainly comprises a preprocessing denoising unit, a parting dimensional feature extraction unit, a multi-scale entropy feature extraction unit and a modal output unit. The method comprises the following specific steps:

(1) firstly, preprocessing and denoising an electroencephalogram signal, wherein the electroencephalogram signal of an FP1 channel of an accompanying person is removed from a signal acquisition module, the length of the signal acquisition module is 128Hz, the time length of signal acquisition is 63s, the baseline time of the first 3s and the signal time are removed for 60s, and therefore the total number of sampling points is 7680. The system selects db5 wavelet to carry out 5-layer wavelet decomposition, then uses a soft threshold method to set the wavelet packet coefficient generated by noise to zero, and finally completes reconstruction of the electroencephalogram signal. The EEG signal before and after pre-processing can be smoother and suitable for further processing.

(2) In the parting dimension characteristic extraction part, the length of an electroencephalogram signal is fixed to be N, then different K values are uniformly sampled, then new K sequences are constructed, intervals between two adjacent numbers are selected from the K sequences, a new sequence is constructed, then the lengths of all the new sequences are calculated, a slope is obtained by using least square fitting, and finally the opposite number of the slope is removed, namely the required FD initial characteristic. And then, performing window processing on the wavelet threshold value denoised electroencephalogram signals, segmenting data by using a window with the length of 256 points, dividing the data into 30 segments, extracting a parting dimension feature from each segment of data, and obtaining a 30-dimensional feature vector.

(3) In the multi-scale entropy feature extraction unit, multi-scale entropy analyzes the complexity of the time sequence from different time scales. In order to calculate the sample entropy at different time scales, the raw signal needs to be coarse grained. Coarse-grained refers to segmenting the original signal using non-overlapping windows of length i. The entropy values of the samples obtained under different scales are different, and the dimensionality of the formed multi-scale entropy features is also different. The system selects electroencephalogram signals of an FP1 channel of an accompanying detector, extracts electroencephalogram data obtained by all experiments of the accompanying detector, judges a sample with the value dimension larger than 6 in self evaluation as happy emotion, and judges the sample with the value dimension smaller than 4 as sad emotion. Computer brain

And (3) obtaining the multi-scale entropy of the electric signal, solving the average multi-scale entropy value of the happy emotion and the sad emotion in the experiment, then selecting the multi-scale entropy under the first scales as the feature vector of the electroencephalogram signal, and obtaining the 15-dimensional multi-scale soil moisture feature altogether.

(4) And after the parting dimension characteristic and the multi-scale entropy characteristic are obtained, the modal output unit is used for pre-storing and outputting the modal.

5. Feature fusion module based on self-attention mechanism

Fig. 5 is a schematic block diagram of a feature fusion module based on a self-attention mechanism provided in the present invention, wherein the main processes of the attention calculation mechanism include information input, calculation of attention distribution, and calculation of weighted average of input information. And after multi-modal characteristics such as extracted facial expressions, voice, electroencephalogram signals and the like are subjected to effective fusion, a weight matrix is initialized firstly and used for representing the weight value of each modal characteristic, the weight value of each modal is multiplied by the corresponding special value vector in the characteristic fusion process, and the multiplication is then carried out in cascade. In the whole system model training process, the weight matrix is trained along with the system model, and the corresponding value is continuously changed and adjusted according to the training. Compared with a manually fixed weighted value, the effect is better.

The feature fusion module involves interactions between two types of inputs: alpha is alpha_iiRespectively, are self-attention interactions of modality i. Alpha is alpha_ijIs the attention interaction between modalities, reflecting the effect of modality i on modality j. The modal attention is calculated as follows:

the construction process of the query matrix theta is as follows: connecting the feature vectors of each mode through a first full connection layer, wherein the connection formula of the first full connection layer is y₁＝w₁x+b₁Finally, a characteristic quantity matrix is obtained through activating function output, and a query matrix theta is formed;

the key matrix phi is constructed by the following process: connecting the feature vectors of each mode through a second fully-connected layer, wherein the connection formula of the second fully-connected layer is y₂＝w₂x+b₂Finally, a characteristic quantity matrix is obtained through activating function output to form a key matrix phi;

Ψ_i＝∑_ka_ki

the sum of all the modalities being high in weight is 1, i.e.:

∑_iΨ_i＝1。

the modal attention A is trained along with the training process of the SVM classification model, and the parameter w in the first full-connection layer and the parameter w in the second full-connection layer are adjusted₁、b₁、w₂And b₂。

6. Identification module based on discrete emotion classification

Fig. 6 is a schematic block diagram of the recognition module based on discrete emotion classification according to the present invention, and after the feature vectors of multiple modes are calculated, data normalization is usually required to be performed on the feature vectors. When the data are used for SVM classification, compared with the original data, the normalized data have the advantages that the training time is shortened, the testing accuracy is improved, the data are more compact due to the normalization of the data, and the optimal classification hyperplane can be obtained. The system uses svm-scale to correspondingly scale the data, and scales the data size to [0,1] or [ -1,1], wherein the purpose of scaling is to prevent a certain characteristic from being too large or too small, accelerate the calculation speed and facilitate the training of the model. The system selects an RBF kernel function as a kernel function of an SVM classification algorithm, and the RBF kernel function is a special case corresponding to nonlinear mapping, can process nonlinear separable problems and is suitable for processing multidimensional vectors. After model training, the identification module correctly outputs one of the four emotion labels of happy, sad, calm and aversion.

7. Prediction module based on continuous emotion

Fig. 7 is a schematic block diagram of a prediction module based on continuous emotion, where emotion prediction of continuous dimensions is defined as happiness, tension, excitement, and certainty, emotion prediction is defined as a value range of 0 to 10 in a standard quantization manner, data fitting training is performed on multi-modal emotion features by using a multivariate nonlinear regression method, and finally, the module outputs emotion intensity values corresponding to four different dimensions.

The multi-modal emotion recognition system for the accompanying robot, which is shown in the embodiment, well solves the problems of emotion interaction and recognition of the accompanying robot, and realizes preprocessing and feature extraction of heterogeneous data by fully collecting multi-modal information such as facial expressions, voice signals and difficult signals in combination with different recognition modules, and further realizes multi-modal information fusion of a feature layer based on a self-attention mechanism. Different from the traditional robot carrying emotion recognition system, the invention fully combines discrete emotion classification and continuous dimension emotion prediction, completely depicts the emotion characteristic space of the main body, can comprehensively obtain recognized emotion feedback from the system, predicts the emotion transformation trend of the main body and greatly improves the accuracy of emotion recognition.

Example 2

The embodiment provides a multi-modal emotion recognition method for a companion robot, which corresponds to the processing process of each module in the multi-modal emotion recognition system in embodiment 1, and specifically comprises the following steps:

The extracting of the emotional feature vector of the facial expression specifically includes:

The extraction process of the homogeneous pattern LBP algorithm comprises the following steps:

constructing a texture region with the size of 3 x 3, wherein a threshold value is a central pixel value of the texture region, comparing 8 surrounding pixel values with the threshold value, and if the value is greater than the value of the threshold pixel, setting the neighborhood position as 1; if the value is less than the value of the threshold pixel, the neighborhood is set to 0; in the texture region of 3 x 3, forming 8-bit binary numbers by the values generated by 8 adjacent pixel points according to the clockwise direction, counting the jumping times from 0 to 1 or from 1 to 0 in the 8-bit binary numbers, if the jumping number is less than two times, the decimal number corresponding to the binary number is the LBP value of the center of the 3 x 3 neighborhood; if the jumping times are larger than 2, if P is 8, setting the LBP value of the center of the area as P +1 is 9; and traversing all the pixel points to obtain the LBP value of the whole image, and sequentially connecting all the LBP values into a feature vector, namely the emotional feature vector of the facial expression.

The extracting of the emotion feature vector of the voice specifically includes:

firstly, windowing processing is carried out on a voice signal, a Hamming window is used for carrying out smoothing processing, and a time domain signal is converted into a frequency domain for subsequent spectrum analysis. Then, a high-pass filter is designed to eliminate noise interference of vocal cord pronunciation, and MFCC feature extraction is carried out; and inputting the spectrogram subjected to Fourier transform into a pre-constructed and trained convolutional neural network layer, and extracting spectrogram features to obtain the emotional feature vector of the voice.

The extracting of the feature vector of the electroencephalogram signal specifically comprises:

The pre-processing denoising comprises:

The extraction process of the parting dimension features comprises the following steps:

after the original sequence is analyzed, a new sequence signal is obtained by sampling, window processing is carried out on the preprocessed and denoised electroencephalogram signal, then the data are divided into a plurality of sections by using a window, and the parting dimensional characteristics are respectively extracted according to each section of data.

Before the classification recognition, the method further comprises the following steps: respectively carrying out data normalization on the emotional feature vectors of the facial expressions, the emotional feature vectors of the voices and the feature vectors of the electroencephalogram signals to obtain fusion features, and carrying out classification and identification; the classification model adopts an SVM classification model, and the kernel function of the SVM classification model is an RBF kernel function; the emotion tags include happy, sad, calm, and disliked.

The foregoing detailed description of the preferred embodiments of the invention has been presented. It should be understood that numerous modifications and variations could be devised by those skilled in the art in light of the present teachings without departing from the inventive concepts. Therefore, the technical solutions available to those skilled in the art through logic analysis, reasoning and limited experiments based on the prior art according to the concept of the present invention should be within the scope of protection defined by the claims.

Claims

1. A multi-mode emotion recognition method for a companion robot is characterized by comprising the following steps of:

2. The multi-modal emotion recognition method for a companion robot as claimed in claim 1, wherein the extracting emotion feature vectors of facial expressions specifically comprises:

3. The multi-modal emotion recognition method for a companion robot as claimed in claim 2, wherein the extraction process of the homogeneous pattern LBP algorithm comprises:

4. The multi-modal emotion recognition method for a companion robot as claimed in claim 1, wherein the extracting emotion feature vectors of speech specifically comprises:

5. The multi-modal emotion recognition method for a companion robot as claimed in claim 1, wherein said extracting feature vectors of electroencephalogram signals specifically comprises:

6. The multi-modal emotion recognition method for a companion robot as recited in claim 5, wherein the pre-processing denoising comprises:

7. The multi-modal emotion recognition method for a companion robot as claimed in claim 6, wherein the extraction process of the fractal dimension features comprises:

8. The multi-modal emotion recognition method for a companion robot as recited in claim 1, further comprising, prior to said classification recognition: respectively carrying out data normalization on the emotional feature vectors of the facial expressions, the emotional feature vectors of the voices and the feature vectors of the electroencephalogram signals to obtain fusion features, and carrying out classification and identification; the classification model adopts an SVM classification model, and the kernel function of the SVM classification model is an RBF kernel function; the emotion tags include happy, sad, calm, and disliked.

9. The multi-modal emotion recognition method for a companion robot as claimed in claim 1, wherein the weight matrix is constructed by calculating modal attention through a self-attention mechanism to obtain fusion weights, and the calculation expression of the modal attention is as follows:

Ψ_i＝∑_ka_ki

the sum of all the modalities being high in weight is 1, i.e.:

∑_iΨ_i＝1

the modal attention A is trained together with the training process of the classification model to adjust the parameter w in the first fully-connected layer and the second fully-connected layer₁、b₁、w₂And b₂。

10. A system adopting the multi-modal emotion recognition method for a companion robot as set forth in any one of claims 1 to 9, comprising: