CN106096642B

CN106096642B - Multi-mode emotional feature fusion method based on identification of local preserving projection

Info

Publication number: CN106096642B
Application number: CN201610397708.4A
Authority: CN
Inventors: 徐嵚嵛; 卢官明; 闫静杰
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2016-06-07
Filing date: 2016-06-07
Publication date: 2020-11-13
Anticipated expiration: 2036-06-07
Also published as: CN106096642A

Abstract

The invention discloses a multi-mode emotional feature fusion method based on identification local preserving projection, which comprises the steps of firstly extracting emotional features, such as voice features, expression features, posture features and the like, from sample data of each mode in a multi-mode emotional database, then mapping the emotional features of various modes into a uniform identification subspace by adopting the identification local preserving projection method, and finally performing series fusion on a plurality of groups of mapped features to obtain the fused multi-mode emotional features. The classifier taking the fused multi-modal emotion characteristics as input can effectively identify basic emotions such as anger, counter-emotion, fear, happiness, sadness, surprise and the like, and provides a new method and a new way for developing a human emotion classification identification system and realizing human-computer interaction.

Description

Multi-mode emotional feature fusion method based on identification of local preserving projection

Technical Field

The invention belongs to the field of image processing and pattern recognition, relates to a feature fusion method applied to multi-mode emotion recognition, and particularly relates to a multi-mode emotion feature fusion method based on identification of local preserving projection.

Background

Emotional expression has been the most prominent way for humans to communicate and understand each other. As Computer technology has been developed, Human Computer Interaction (HCI) has become more and more valuable for research and practical purposes, and how computers recognize Human emotions has become important. With the continuous development of information technology, emotional information expressed by human beings, whether in a laboratory or in real life, is easily obtained by various sensors. Where images and speech are the most readily available affective information and are also the most important information for emotion recognition.

It is a complicated problem that a computer can recognize which emotions, which are expressed by people in real life, often have only slight differences which are difficult to distinguish by humans, so that the computer can recognize only basic emotions such as anger, counter-sensation, fear, happiness, sadness, surprise and the like at present. However, emotion technologies for recognizing these basic emotions have been widely used, for example, in education, medical treatment, human-computer interaction, video entertainment, and the like.

Over the past decades, there have been many single-modality based emotion recognitions, most commonly facial expression emotion recognition, speech emotion recognition and gesture-based emotion recognition, however, single-modality emotion recognition has a major limitation because the emotion information expressed by a person is a multi-modal emotion information, for example, a person expresses anger, and his voice, facial expression, body posture, heart rate and body temperature, etc. are greatly different from those in a normal state. If only one modal emotional feature is used for identification, better results are not obtained, especially in a real environment. Research results show that compared with single-mode emotion recognition, multi-mode emotion recognition is more reliable and accurate. The multi-modal emotion recognition considers various emotion information expressed by a person, comprehensively measures the emotion expressed by the person, and has robustness to interference of different conditions (for example, image information of a face may have different illumination, angles and other problems) in actual life.

For multi-modal emotion recognition, feature fusion is the most important ring, and different emotion features obtained by different sensors are fused to obtain fused features which are sent to a classifier for recognition. Common feature fusion methods are mainly classified into three categories: and obtaining a hierarchical fusion method, a characteristic layer fusion method and a decision layer fusion method. At present, in order to facilitate real-time performance, the three methods need to maintain enough important information and realize information compression, and inevitably have information loss, so that the identification accuracy is reduced. The feature layer fusion method has wide application in the fields of voice and images. At present, the research on multi-modal emotion recognition is far from perfection and enrichment of single-modal emotion recognition.

In the prior art, the invention patent with the publication number of CN105138991A and the name of 'a video emotion recognition method based on emotion significant feature fusion' discloses a video emotion recognition method based on emotion significant feature fusion, and the defects of the method are as follows: only image features and voice features in the video can be subjected to feature fusion, the expandability is poor, and features of other more modes cannot be subjected to feature fusion; the extracted image and voice features are not direct emotional features but are represented by color emotional intensity values and audio emotional dictionaries; the fusion algorithm is too simple, and the emotion characteristics after the fusion algorithm is simply weighted are poorer in distinguishability.

Disclosure of Invention

The invention aims to solve the technical problems that the fused emotion feature discrimination is poor in the multi-modal emotion recognition feature fusion method and the existing single-modal emotion recognition technology cannot obtain a more accurate recognition result.

In order to solve the problems, the invention provides a multi-mode emotion feature fusion method based on identification local preserving projection aiming at the requirements of an automatic human emotion assessment system and a human-computer interaction system, and provides a more accurate and reliable way for human-computer interaction. The specific technical scheme is as follows:

the multi-mode emotion feature fusion method based on identification of local preserving projection comprises the following steps:

A. firstly, extracting emotional characteristics from sample data of each mode in a multi-mode emotional database, then carrying out dimension reduction processing on emotional characteristic vectors of various modes, and using d for a sample of a jth mode_jDimensional feature vector x_ijrTo indicate that is

Wherein j is more than or equal to 1 and less than or equal to m, m is the number of modes, i is more than or equal to 1 and less than or equal to c, c is the number of emotion categories, r is more than or equal to 1 and less than or equal to n_ij，n_ijNumber of samples belonging to the ith emotion, jth mode, x_ijrRepresenting a characteristic vector of an r sample belonging to an ith emotion and a jth mode;

B. identifying local preserving projection on the feature vectors of different modes after dimension reduction to obtain the optimal projection direction alpha;

C. mapping the feature vectors of different modes, Y_j＝α^TX_j，X_jIs c X_ijThe constituent matrices, i.e. X_j＝[X_1j,...,X_ij,...,X_cj]^T；

D. And connecting the mapped features in series to obtain a fusion feature:

Z＝[α^TX₁,...,α^TX_j,...,α^TX_m]^T。

further, step (ii)B, after dimension reduction, identifying local preserving projection is carried out to solve the optimal projection matrix alpha and obtain the emotional characteristic vectors x of various modes_ijrMapping to a uniform identification subspace to obtain a mapped feature vector y_ijrThe method comprises the following specific steps:

b1: defining a matrix of dispersion within a class

Wherein, y_iklRepresenting the feature vector after mapping of the ith sample from the ith emotion and the kth mode, wherein k is more than or equal to 1 and less than or equal to m, and W_rlWeights are maintained for local between feature vectors from the same emotion and modality;

b2: defining inter-class dispersion matrices

Wherein B is_ihFor local preservation of weights, μ, between feature vector means from the same modality_iAnd (3) the mean value of the feature vector after mapping for the ith sample:

wherein n is_iIs the number of samples in class i,. mu._hThe characteristic vector mean value of the h type sample is obtained;

b3: maximizing the inter-class dispersion matrix and minimizing the intra-class dispersion matrix, this goal can be expressed as the following optimization problem:

wherein Tr (-) is the trace of the matrix.

Further, a matrix of dispersion within the defined class

In step B1, a local retention weight matrix W between the feature vectors_rlSpecifically, the following are defined:

defining feature vectors x from the same emotion and modality_ijrAnd x_ijlLocal hold weight matrix in between

Wherein x is_ijlRepresenting the feature vector of the ith sample from the ith emotion and the jth mode, 1 ≦ l ≦ n_ijThe parameter t may be set empirically, regardless of the weights between feature vectors from different emotions or modalities.

Further, defining inter-class dispersion matrices

In step B2, a local retention weight matrix B between the feature vector means_ihThe specific steps and definitions are as follows:

firstly, calculating the characteristic vector mean value of the ith emotion and the jth mode

Wherein

The superscript (x) represents the original sample space, and the mean of the feature vector from the h-type emotion and the j-type mode is calculated

Wherein n is_hjNumber of samples belonging to the h-emotion, j-mode, x_hjrRepresenting the characteristic vector of the r sample belonging to the h type emotion and the j type mode, wherein h is more than or equal to 1 and is less than or equal to c;

defining mean values of feature vectors from the same modality

And

local hold weight matrix in between

The parameter t can also be set empirically, without considering the weights between them for the mean of the feature vectors from different modalities.

Further, in step B3, the optimization problem maximizes the inter-class dispersion matrix and minimizes the intra-class dispersion to obtain the maximum projection direction

The method comprises the following specific steps:

b3.1: transforming the optimization problem in B3 to obtain the following optimization problem:

the part of the son-mother in the optimized expression is an in-class dispersion matrix:

wherein the matrix

Can be expressed as:

wherein mu_ik ^(x)Is the mean value of the feature vector from the ith emotion, the kth mode_ikNumber of samples from the ith emotion, kth modality, X_ijIs n_ijA feature vector x_ijrFormed feature matrix, L ═ mD_rr-W_rl,D_rrIs a diagonal matrix whose value is the row or column sum of a weight matrix W of the inter-sample eigenvectors (W being a symmetric matrix), i.e. a

The sub-divisions in the optimized formula are inter-class dispersion matrices:

wherein the matrix

Can be expressed as:

wherein

Is c mean vectors

A composed matrix, E_jjIs a local holding weight B of the mean_ihAnd row or column of, i.e.

B3.2: because the optimization problem in B3.1 does not have a closed-form solution, the ratio of traces needs to be converted into the trace of the ratio, and the following optimization problem is finally obtained:

the optimal projection matrix is obtained by solving the above formula through a generalized eigenvalue decomposition method

Compared with the prior art, the invention has the advantages that:

(1) compared with single-mode emotional characteristics, the emotional characteristics adopting multi-mode fusion in the emotional recognition problem have higher accuracy and objectivity and better robustness in real situations.

(2) The multimode emotional characteristic fusion method based on the identification of the local preserving projection not only considers the dispersion between classes, but also considers the dispersion in the classes, has better discrimination on samples of different classes, and the introduced local preserving projection can be well adapted to the nonlinear condition. Finally, obtaining the multi-mode emotion fusion characteristics more suitable for emotion recognition.

The invention introduces a multi-modal emotion characteristic fusion method based on identification local preserving projection, applies the multi-modal emotion characteristic fusion method to multi-modal expression classification and identification work, can effectively identify six expressions such as anger, counter sensation, fear, happiness, sadness, surprise and the like, and provides a new method and approach for developing a human emotion automatic evaluation system and a human-computer interaction system.

Drawings

FIG. 1 is a flow chart of the multi-modal emotion feature fusion method based on the identification of local preserving projection of the present invention.

FIG. 2 is a partial image in a bimodal emotion database.

Detailed Description

The embodiments of the present invention will now be described in further detail with reference to the accompanying drawings. The implementation of the multi-modal emotional feature fusion method based on the identification of local preserving projection, as shown in fig. 1, mainly comprises the following steps:

step 1: capturing still images and speech segments of video in a multimodal database

In the specific implementation process, an eNBACTACE bimodal database is adopted. The database contains 1260 video segments from 42 people, each with emotion tags, expressing 6 basic emotions: anger, reaction, fear, happiness, sadness, and surprise (labels 1-6, respectively), as shown in fig. 2. The video size is 720 × 576, the sampling frequency is 25fps, and the sampling frequency of the sound in the video is 48 kHz. And (4) framing the video, and taking a frame with the most abundant expression as a static picture of the video. And separating voice of each video as a voice segment corresponding to the video. Finally, each video clip corresponds to a static image and a voice. And randomly selecting 75% of the images and the corresponding voices as training samples, and using the rest 25% as test samples.

Step 2: extracting the characteristics of image and voice information, reducing dimensions, and expressing by characteristic vector

Firstly, the static image obtained in the previous step is cut, the image of the face is cut out, the size of the image is 128 × 128, then image preprocessing operations such as alignment, scale normalization and gray level equalization are performed, and finally features such as Gabor, SIFT and LBP are extracted from the image (in the embodiment, Gabor features are extracted). For speech segments, use is made ofThe specialized speech processing toolbox OpenSmile extracts various features (in this embodiment, the emobase2010 feature is extracted). Because the extracted feature vector often has the problem of overhigh dimension, the feature with proper dimension is obtained by using a PCA dimension reduction method and using d_jDimensional feature vectors to represent the reduced-dimension image features and speech feature vectors, i.e.

Wherein j is more than or equal to 1 and less than or equal to m, m is the number of modes, i is more than or equal to 1 and less than or equal to c, c is the number of emotion categories, r is more than or equal to 1 and less than or equal to n_ij，n_ijNumber of samples belonging to the ith emotion, jth mode, x_ijrA feature vector representing the r-th sample belonging to the i-th emotion and the j-th mode, and n_iIs the number of samples in class i and n is the number of all samples. In this embodiment, c is 6, m is 2, n_ij＝210，n_iFor other different multimodal databases, only these parameters need to be changed, for example, for a trimodal database, m is 3.

And step 3: solving the optimal projection matrix alpha by adopting a method of identifying local preserving projection, and enabling the emotional characteristic vectors x of various modes_ijrMapping to a uniform identification subspace to obtain a mapped feature vector y_ijrThe method comprises the following specific steps:

first, feature vectors x from the same class and modality are defined_ijrAnd x_ijlLocal hold weight matrix in between

Wherein x is_ijlRepresenting the feature vector of the i-th sample from the i-th and j-th modes, 1 ≦ l ≦ n_ijThe parameter t may be obtained empirically. The weights between them are not taken into account for feature vectors from different modalities or classes. Then defining the dispersion matrix in class of each class

Wherein, y_iklRepresenting the feature vector after mapping of the ith sample from the ith emotion and the kth mode, wherein k is more than or equal to 1 and less than or equal to m.

Then, the mean value of the feature vectors from the i-type emotion and the j-type mode is obtained

Wherein

Wherein n is_hjNumber of samples belonging to the h-emotion, j-mode, x_hjrAnd the characteristic vector of the r sample belonging to the h type emotion and the j type mode is represented, and h is more than or equal to 1 and less than or equal to c. And intra-class dispersion matrix

Similarly, feature vector means from the same modality are defined

And

local hold weight matrix in between

An inter-class divergence matrix for each class is then defined

Wherein, mu_iAverage of feature vectors after mapping for class i samples:

similarly, μ_hIs the average of the h-th class sample features.

Finally, in order to maximize the inter-class dispersion matrix while minimizing the intra-class dispersion matrix, the following optimization equation is obtained:

wherein Tr (-) is the trace of the matrix. The following optimization problem is obtained by simplification and transformation:

wherein the matrix

Can be expressed as:

wherein mu_ik ^(x)Is the mean value of the feature vector from the ith emotion, the kth mode_ikFor emotion from class i, number of samples in modality k, X_ijIs n_ijA feature vector x_ijrFormed feature matrix, L ═ mD_rr-W_rl,D_rrIs a diagonal matrix whose value is the row or column sum of a weight matrix W of the inter-sample eigenvectors (W being a symmetric matrix), i.e. a

The sub-divisions in the optimized formula are inter-class dispersion matrices:

wherein the matrix

Can be expressed as:

wherein

Mean vector of c features

Since there is no closed-form solution for equation (9), the trace ratio needs to be converted into a trace of the ratio:

solving the equation (13) by generalized eigenvalue decomposition to obtain the optimal mapping

And 4, step 4: projecting the training sample and the test sample to obtain mapped features, and connecting the mapped features in series to obtain fused features

Mapping image features and speech features by multiplying alpha, Y_j＝α^TX_jWherein X is_jIs c X_ijThe constituent matrices, i.e. X_j＝[X_1j,...,X_ij,...,X_cj]^TAnd then, connecting the mapped features in series, wherein the specific method comprises the following steps:

and 5: sending the fusion characteristics of the training samples into a classifier for training and testing by using the test samples

The fusion features of the training samples obtained in the previous step are sent to a classifier (in this embodiment, libSVM is used), appropriate models and parameters are obtained through training of the classifier, and finally, test data are sent to the classifier to obtain a recognition result.

The above embodiments are not intended to limit the present invention, and any modifications, equivalents, improvements, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A multi-mode emotion feature fusion method based on identification of local preserving projection is characterized by comprising the following steps:

B. identifying local preserving projection to the characteristic vectors of different modes after dimension reduction, obtaining the optimal projection direction alpha of identifying the local preserving projection by maximizing the inter-class dispersion matrix and minimizing the intra-class dispersion matrix, and obtaining a local preserving weight matrix W between the characteristic vectors_rlSpecifically, the following are defined:

Wherein x is_ijlRepresenting the feature vector of the ith sample from the ith emotion and the jth mode, 1 ≦ l ≦ n_ijThe parameter t can be set empirically, and the weight between the feature vectors from different emotions or modalities is not considered, and beta is generally 3-5, and comprises:

the objective of the described discriminating locally preserving projection is to solve the optimal projection matrix alpha^*The emotional feature vectors x of various modes are combined_ijrMapping to a uniform identification subspace to obtain a mapped feature vector y_ijrThe method comprises the following specific steps:

b1: defining a matrix of dispersion within a class

b2: defining inter-class dispersion matrices

local hold weight matrix B between the feature vector means_ihThe specific steps and definitions are as follows:

Wherein

defining mean values of feature vectors from the same modality

And

local hold weight in betweenMatrix array

Wherein the parameter t can also be set empirically, without considering the weights between the mean values of the feature vectors from different modalities

wherein Tr (-) is the trace of the matrix, here

To form an optimal projection direction matrix alpha^*J is more than or equal to 1 and less than or equal to m, and m is the number of modes;

the optimization problem is to maximize the inter-class dispersion matrix and minimize the intra-class dispersion to obtain the maximum projection direction

The method comprises the following specific steps:

wherein the matrix

Can be expressed as:

wherein mu_ik ^(x)Is the mean value of the feature vector from the ith emotion, the kth mode_ikNumber of samples from the ith emotion, kth modality, X_ijIs n_ijA feature vector x_ijrFormed feature matrix, L ═ mD_rr-W_rl,D_rrIs a diagonal matrix whose value is the row or column sum of a weight matrix W of the inter-sample eigenvectors, where W is a symmetric matrix and the row and column sums are equal, i.e.

The sub-divisions in the optimized formula are inter-class dispersion matrices:

wherein the matrix

Can be expressed as:

wherein

Is c mean vectors

the optimal projection direction is obtained by solving the above formula through a generalized eigenvalue decomposition method

C. Mapping the feature vectors of different modes, Y_j＝α^TX_j，X_jIs c X_ijThe constituent matrices, i.e. X_j＝[X_1j,...,X_ij,...,X_cj]^TWherein X is_ijRepresenting the emotional characteristics of different modes of different emotional categories, wherein i is more than or equal to 1 and less than or equal to c, c is the number of the emotional categories, j is more than or equal to 1 and less than or equal to m, and m is the number of the modes;

D. and connecting the mapped features in series to obtain a fusion feature:

Z＝[α^TX₁,...,α^TX_j,...,α^TX_m]^T；

E. and sending the fusion characteristics of the training samples into a classifier for training and testing by using the test samples.