CN108596039A

CN108596039A - A kind of bimodal emotion recognition method and system based on 3D convolutional neural networks

Info

Publication number: CN108596039A
Application number: CN201810267991.8A
Authority: CN
Inventors: 卢官明; 郭迪; 闫静杰; 卢峻禾
Original assignee: Nanjing Post and Telecommunication University
Current assignee: Nanjing Post and Telecommunication University; Nanjing University of Posts and Telecommunications
Priority date: 2018-03-29
Filing date: 2018-03-29
Publication date: 2018-09-28
Anticipated expiration: 2038-03-29
Also published as: CN108596039B

Abstract

The invention discloses a kind of bimodal emotion recognition method and system based on 3D convolutional neural networks.Structure is used for two kinds of 3D convolutional neural networks of expression emotion recognition and posture emotion recognition, and the training set based on bimodal emotion video library and verification collection optimization network model parameter to this method respectively first；The test set for being then based on bimodal emotion video library respectively tests two kinds of neural networks after optimization, obtains expression emotion recognition confusion matrix and posture emotion recognition confusion matrix；The priori for finally utilizing expression emotion recognition confusion matrix and posture emotion recognition confusion matrix, will merge the recognition result of the both modalities which of the expression video sequence and posture video sequence that newly input, obtains the emotional semantic classification result of bimodal.This method uses 3D convolutional neural networks and bimodal blending algorithm, avoids the subjectivity of artificial design features, overcomes the limitation of single mode emotion recognition, and can effectively improve the accuracy and robustness of emotion recognition.

Description

A kind of bimodal emotion recognition method and system based on 3D convolutional neural networks

Technical field

The invention belongs to machine learning and area of pattern recognition, it is related to a kind of video feeling recognition methods and system, especially It is related to a kind of bimodal emotion recognition method and system based on 3D convolutional neural networks.

Background technology

With the high speed development of science and technology, the mankind constantly enhance the dependence of computer, and interactive capability is ground The attention for the person of studying carefully.One of the important goal of computer science development is how personalizing for realization computer, this has become One hot issue of the area research.A critical issue of required solution is to realize computer in human-computer interaction Emotion recognition ability.

Emotion recognition ability is an importance of computer intelligence, it reflects computer and passes through the information to acquisition Judge the ability of the affective state of operator or interlocutor.By studying emotion recognition technology, machine can be identified and understands people Emotion, people can establish more friendly, harmonious man-machine interaction environment.Emotion recognition technology human-computer interaction, medical treatment, The fields such as safety, education and amusement have broad application prospects.With the deep and computer of emotion recognition technical research The continuous improvement of emotion recognition ability will greatly improve the quality of life of the mankind.

Currently, the research of emotion recognition is greatly for single mode such as facial expression, voice or EEG signals It carries out.Compared to single mode, two or more mode possess more emotion informations.The mankind be also by multi-modal mode come Show emotion information.Therefore, multiple modalities signal is excavated and merged to depth, is the one kind for further increasing emotion recognition performance Effective way.

A kind of Chinese patent application " bimodal video feeling recognition methods of compound space-time characteristic " (number of patent application 201611096937.9, publication number CN106529504A), by extract respectively upper body posture sample and human face expression sample when Three value pattern square (TSLTPM) histogram features of empty part and three-dimensional gradient direction (3DHOG) histogram feature, constitute corresponding sample This compound space-time characteristic of upper body posture and the compound space-time characteristic of human face expression, finally uses D-S evidence theory decision rule pair Compound space-time characteristic test collection is classified, and emotion recognition result is obtained.This method uses the feature of engineer, feature extraction Process is relatively complicated, and complexity is higher, in addition, when being merged using D-S evidence theory decision rule, will produce Yin Ji The minor change of this probability distribution function and the unstability for causing fusion results completely different, and in processing conflict completely or The result for running counter to convention is generated when height conflicting evidence.

Chinese patent application " mankind's nature emotion identification method combined based on expression and behavior bimodal " (patent Application number 201610654684.6, publication number CN106295568A), using the emotion cognition framework of two-stage classification pattern, first The trunk motion feature of extraction match comparing with the trunk motion feature library established in advance, it is thick to obtain emotion Classification results；Then, match from the human face expression feature that the human face expression feature database established in advance is found out and is extracted Human face expression feature exports corresponding emotion disaggregated classification result.Greatest problem existing for this method is can not to extract effective people Body torso exercise feature, and it is difficult to set up effective trunk motion feature library and human face expression feature database.

Invention content

Goal of the invention：In view of the deficiencies of the prior art, a kind of based on 3D convolutional neural networks present invention aims at providing Bimodal emotion recognition method and system the extraction of feature is simplified by powerful feature learning and classification capacity, and improve The accuracy and robustness of emotion recognition.

Technical solution：The present invention uses following technical scheme for achieving the above object：

A kind of bimodal emotion recognition method based on 3D convolutional neural networks, includes the following steps：

(1) while everyone facial expression video clip and body posture video clip sample is obtained, by each Video clip is trimmed into an isometric frame sequence, establishes the expression and posture bimodal emotion video for including emotional category label Library, and the sample of bimodal emotion video library is divided into training set, verification collection and test set；

(2) utilize the expression video sequence and posture video sequence that training set and verification are concentrated respectively to the first 3D of structure Convolutional neural networks and the 2nd 3D convolutional neural networks are trained, and optimize network model parameter；The training set is used for network Training after often training iteration preset times, is once tested, whether the selection for verifying network parameter closes on verification collection Reason；The first 3D convolutional neural networks and the 2nd 3D convolutional neural networks include：

Data input layer is used for input video sequence, every frame image in video sequence is normalized；

The composite module of at least two convolutional layer and pond layer, wherein convolutional layer are using several 3D convolution kernels to last layer Output carries out convolution algorithm, and pond layer is used for the output to convolutional layer and carries out down-sampling operation；

Full articulamentum, the output neuron for the output of last layer pond layer to be fully connected to this layer export a spy Sign vector；

And classification layer, the feature vector for exporting full articulamentum are connected to the output section for indicating emotional category entirely Point, exports a n-dimensional vector, and wherein n is emotional category number；

Preferably, the first 3D convolutional neural networks, including the 1 data input layer, at least two volume that are linked in sequence The composite module of lamination and pond layer, 1 full articulamentum and 1 Softmax classification layer；

The data input layer is first layer, is inputted as expression video sequence, to every frame image progress in video sequence Normalized；The length of the expression video sequence is 16,24 or 32 frames；

The composite module of the convolutional layer and pond layer, including 1 convolutional layer and 1 pond layer, wherein convolutional layer includes ReLU nonlinear activation function layers select m₁A d₁×k₁×k₁3D convolution kernels convolution algorithm is carried out to the output of last layer, In, d₁、k₁It is chosen in 3,5,7 numerical value, m₁It is chosen in 32,64,128,256,512 numerical value；Pond layer choosing d₂×k₂×k₂ Pondization verification last layer convolutional layer output carry out down-sampling operation, wherein d₂、k₂It is chosen in 1,2,3 numerical value；

The output of last layer pond layer is fully connected to the c output neuron of this layer by the full articulamentum, exports a c The feature vector of dimension, wherein c chooses in 256,512,1024 numerical value；

The feature vector of the full articulamentum output of last layer is connected to n output node by the Softmax classification layers entirely, is passed through It crosses after Softmax is returned and obtains a n-dimensional vector [p₁ p₂ p₃ … p_n]^T, the wherein numerical value of each dimension is exactly to input to regard The emotional category of frequency sequence belongs to the probability of corresponding classification；N is emotional category number.

Preferably, the 2nd 3D convolutional neural networks, including the 1 data input layer, at least two volume that are linked in sequence The composite module of lamination and pond layer, 1 full articulamentum and 1 Softmax classification layer；

The data input layer is first layer, is inputted as posture video sequence, to every frame image progress in video sequence Normalized；The length of the posture video sequence is 16,24 or 32 frames；

The composite module of the convolutional layer and pond layer, including 1 convolutional layer and 1 pond layer, wherein convolutional layer includes ReLU nonlinear activation function layers select m₂A d₃×k₃×k₃3D convolution kernels convolution algorithm is carried out to the output of last layer, In, d₃、k₃It is chosen in 3,5,7 numerical value, m₂It is chosen in 32,64,128,256,512 numerical value；Pond layer choosing d₄×k₄×k₄ Pondization verification last layer convolutional layer output carry out down-sampling operation, wherein d₄、k₄It is chosen in 1,2,3 numerical value；

The feature vector of the full articulamentum output of last layer is connected to n output node by the Softmax classification layers entirely, is passed through It crosses after Softmax is returned and obtains a n-dimensional vector [q₁ q₂ q₃ … q_n]^T, the wherein numerical value of each dimension is exactly to input to regard The emotional category of frequency sequence belongs to the probability of corresponding classification.

(3) emotion is carried out to the expression video sequence samples in test set using the first 3D convolutional neural networks after optimization Classification and Identification obtains a n-dimensional vector, the maximum dimension institute of the numerical values recited of each more vectorial dimension, wherein numerical value Corresponding classification is exactly the emotional category of the sample；Retest is carried out to all expression video sequence samples in test set, Statistical classification recognition result obtains expression emotional semantic classification identity confusion matrix E, i.e.,

Similarly, using the 2nd 3D convolutional neural networks after optimization to the posture video sequence sample in test set into market Feel Classification and Identification, obtains a n-dimensional vector, the maximum dimension of the numerical values recited of each more vectorial dimension, wherein numerical value Corresponding classification is exactly the emotional category of the sample；All posture video sequence samples in test set are carried out to repeat survey Examination, statistical classification recognition result obtain posture emotional semantic classification identity confusion matrix G, i.e.,

(4) the first 3D convolutional neural networks and the 2nd 3D convolutional neural networks table to newly inputting respectively after optimization are utilized Feelings video sequence and posture video sequence carry out emotional semantic classification identification, obtain the emotional semantic classification identification of expression and posture both modalities which As a result；

(5) the expression emotional semantic classification identity confusion matrix E and posture emotional semantic classification identity confusion square that step (3) obtains are utilized The priori of battle array G, fusion is weighted by the emotional semantic classification recognition result for the both modalities which that step (4) obtains in decision-making level, Obtain the emotional semantic classification of bimodal as a result, specific steps are as follows：

(5.1) numerical value of the element on expression emotional semantic classification identity confusion matrix E leading diagonals is normalized, is obtained

(5.2) numerical value of the element on posture emotional semantic classification identity confusion matrix G leading diagonals is normalized, is obtained

(5.3) the emotional semantic classification recognition result of expression and posture both modalities which is weighted fusion, obtains a new n Dimensional vector V, i.e.,

Compare the numerical values recited of each dimension in vectorial V, wherein the classification corresponding to the maximum dimension of numerical value is exactly defeated Enter the emotional category of video sequence.

A kind of bimodal emotion recognition system based on 3D convolutional neural networks that another aspect of the present invention provides, including：

Preprocessing module, for obtaining everyone facial expression video clip and body posture video clip sample simultaneously This, an isometric frame sequence is trimmed by each video clip, establishes the expression comprising emotional category label and posture is double Mode emotion video library, and the sample of bimodal emotion video library is divided into training set, verification collection and test set；

Network model training module, the expression video sequence and posture video sequence concentrated using training set and verification are distinguished First 3D convolutional neural networks of structure and the 2nd 3D convolutional neural networks are trained, network model parameter is optimized；It is described First 3D convolutional neural networks and the 2nd 3D convolutional neural networks include：Data input layer is used for input video sequence, to regarding Image in frequency sequence is normalized；The composite module of at least two convolutional layer and pond layer, if wherein convolutional layer uses Dry 3D convolution kernels carry out convolution algorithm to the output of last layer, and pond layer is used for the output to convolutional layer and carries out down-sampling operation； Full articulamentum, the output neuron for the output of last layer pond layer to be fully connected to this layer export a feature vector；With And classification layer, the feature vector for exporting full articulamentum are connected to the output node for indicating emotional category entirely, export one N-dimensional vector, wherein n are emotional category number；

Confusion matrix acquisition module, the first 3D convolutional neural networks for being utilized respectively after optimizing and the 2nd 3D convolution god Through network in test set expression video sequence samples and posture video sequence sample carry out emotional semantic classification identification, and statistical Class recognition result obtains the expression emotional semantic classification identity confusion matrix and posture emotional semantic classification identity confusion matrix of n × n；

Expression and posture emotional semantic classification identification module utilize the first 3D convolutional neural networks and the 2nd 3D convolution after optimization Neural network carries out emotional semantic classification identification to the expression video sequence and posture video sequence that newly input respectively, obtains expression and appearance The emotional semantic classification recognition result of state both modalities which；

And decision-making module, the expression emotional semantic classification identity confusion matrix for being obtained using confusion matrix acquisition module With the priori of posture emotional semantic classification identity confusion matrix, two kinds of moulds that expression and posture emotional semantic classification identification module are obtained The emotional semantic classification recognition result of state is weighted fusion in decision-making level, obtains the emotional semantic classification result of bimodal.

Advantageous effect：Compared with prior art, the present invention has the following technical effects：

(1) present invention using 3D convolutional neural networks extraction video clip time domain and spatial feature, by feature extraction from Still image is extended to image sequence, is adaptively adjusted parameter by training network, can independently extract being capable of reflecting time The behavioral characteristics of information, the affective characteristics extracted can preferably characterize the variation of facial expression and body posture, relative to Traditional artificial design features have stronger characterization ability and generalization ability, to finally promote the accuracy of Classification and Identification.

(2) present invention carries out emotional semantic classification identification using the information of fusion facial expression and body posture both modalities which, gram The limitation of single mode emotional semantic classification identification is taken.

(3) present invention utilizes table when decision-making level is weighted fusion to the recognition result of expression and posture both modalities which The priori of the emotional semantic classification identity confusion matrix of feelings and posture both modalities which determines the weighted value of weighting, can overcome and adopt Lead to fusion results completely because of the minor change of Basic probability assignment function when being merged with D-S evidence theory decision rule Different unstability, and the problems such as running counter to the result of convention is generated in processing conflict completely or height conflicting evidence, The accuracy and robustness of emotion recognition can be effectively improved.

Description of the drawings

Fig. 1 is a kind of flow chart of bimodal emotion recognition method based on 3D convolutional neural networks of the present invention；

Fig. 2 is a kind of basic framework figure of bimodal emotion recognition method based on 3D convolutional neural networks of the present invention；

Fig. 3 is the partial video truncated picture in FABO databases；(a)-(c) is different facial expression video sectional drawings, (d)-(f) is different body posture video interceptions.

Specific implementation mode

Specific embodiments of the present invention are further described in detail with reference to the accompanying drawings of the specification.

As shown in Figure 1, a kind of bimodal emotion recognition side based on 3D convolutional neural networks provided in an embodiment of the present invention Method mainly includes the following steps：

Step 1：Everyone facial expression video clip and body posture video clip sample is obtained simultaneously, it will be each A video clip is trimmed into an isometric frame sequence, establishes expression and posture bimodal emotion comprising emotional category label and regards Frequency library, and the sample of bimodal emotion video library is divided into training set, verification collection and test set according to a certain percentage.

In the present embodiment, FABO (A Bimodal Face and Body Gesture Database) bimodal feelings are chosen Feel video database.In practice, other video databases can also be used, or twin camera is voluntarily used to acquire facial table Feelings video and body posture video establish the expression and posture bimodal emotion video library for including emotional category label.This implementation The sample that the FABO databases of example provide contains 23 people, everyone has 9 kinds of different emotional categories, including anger, anxiety, It is weary of, detests, fearing, is sad, is surprised, is glad, is uncertain." sad " and " surprised " the two kinds of feelings for including in view of FABO databases The sample number for feeling classification is insufficient, we have chosen anger, anxiety, are weary of, detest, fearing, is glad, not knowing 7 kinds of emotional categories Sample, used respectively 1~7 as emotional category label；Video sample in database is pre-processed, according to 4:1:1 The arbitrary selecting video sample of ratio is respectively as training set, verification collection and test set, and each video clip interception is at 16 frames The video sequence of each sample set and label are stored as lst files by long frame sequence.When practical application, frame length can 16, 24, it is chosen in 32 numerical value.

Step 2：Two kinds of 3D convolutional neural networks are built respectively, wherein the first 3D convolutional neural networks are used for facial expression Emotion recognition, the 2nd 3D convolutional neural networks are used for body posture emotion recognition.

Structure the first 3D convolutional neural networks, including be linked in sequence 1 data input layer, at least two convolutional layer and The composite module of pond layer, 1 full articulamentum and 1 Softmax classification layer；

Data input layer is first layer, is inputted as expression video sequence, to every frame image progress normalizing in video sequence Change is handled；

The composite module of convolutional layer and pond layer, including 1 convolutional layer and 1 pond layer, wherein convolutional layer includes ReLU Nonlinear activation function layer selects m₁A d₁×k₁×k₁3D convolution kernels convolution algorithm is carried out to the output of last layer, wherein m₁、d₁、k₁For positive integer, d₁、k₁It is chosen in 3,5,7 numerical value, m₁It is chosen in 32,64,128,256,512 numerical value；Pond layer Select d₂×k₂×k₂Pondization verification last layer convolutional layer output carry out down-sampling operation, wherein d₂、k₂For positive integer, d₂、 k₂It is chosen in 1,2,3 numerical value；

The output of last layer pond layer is fully connected to the c output neuron of this layer by full articulamentum, exports what a c was tieed up Feature vector, wherein c is positive integer, is chosen in 256,512,1024 numerical value；

The feature vector of the full articulamentum output of last layer is connected to n output node by Softmax classification layers entirely, is passed through Softmax obtains a n-dimensional vector after returning, and the numerical value of wherein each dimension is exactly the emotional category category of input video sequence In the probability of corresponding classification.

Structure the 2nd 3D convolutional neural networks, including be linked in sequence 1 data input layer, at least two convolutional layer and The composite module of pond layer, 1 full articulamentum and 1 Softmax classification layer；

Data input layer is first layer, is inputted as posture video sequence, to every frame image progress normalizing in video sequence Change is handled；

The composite module of convolutional layer and pond layer, including 1 convolutional layer and 1 pond layer, wherein convolutional layer includes ReLU Nonlinear activation function layer selects m₂A d₃×k₃×k₃3D convolution kernels convolution algorithm is carried out to the output of last layer, wherein m₂、d₃、k₃For positive integer, d₃、k₃It is chosen in 3,5,7 numerical value, m₂It is chosen in 32,64,128,256,512 numerical value；Pond layer Select d₄×k₄×k₄Pondization verification last layer convolutional layer output carry out down-sampling operation, wherein d₄、k₄For positive integer, d₄、 k₄It is chosen in 1,2,3 numerical value；

Used database based on the present embodiment can build two kinds of structures are identical, model parameter is different 3D volumes Product neural network, as shown in Fig. 2, concrete structure is as follows：

First layer is data input layer, is 112 by each frame image normalization in the video sequence of 16 frame lengths of input × 112 pixels；

The second layer is convolutional layer 1, selects the feature of 64 3 × 3 × 3 3D convolution kernels pair the first layer data input layer output Figure group carries out convolution operation, and convolution step-length is 1, carries out zero padding (Zero Padding) and operate the length of edged to be 1, after convolution again Nonlinear Mapping is carried out by correcting linear unit (ReLU) function, exports 64 characteristic pattern groups, each characteristic pattern group includes 16 The characteristic pattern that a size is 112 × 112；

Third layer is pond layer 1, the Chi Huahe of selection 1 × 2 × 2, the characteristic pattern group that convolutional layer 1 is exported with step-length 2 into Row down-sampling operates, and exports 64 characteristic pattern groups, and each characteristic pattern group includes the characteristic pattern that 16 sizes are 56 × 56；

4th layer is convolutional layer 2, and the 3D convolution kernels selected 128 3 × 3 × 3 carry out the characteristic pattern group that pond layer 1 exports Convolution operation, convolution step-length are 1, and the length for carrying out zero padding operation edged is 1, using amendment linear unit (ReLU) after convolution Function carries out Nonlinear Mapping, exports 128 characteristic pattern groups, and each characteristic pattern group includes the feature that 16 sizes are 56 × 56 Figure；

Layer 5 is pond layer 2, the Chi Huahe of selection 2 × 2 × 2, the characteristic pattern group that convolutional layer 2 is exported with step-length 2 into Row down-sampling operates, and exports 128 characteristic pattern groups, and each characteristic pattern group includes the characteristic pattern that 8 sizes are 28 × 28；

Layer 6 is convolutional layer 3, and the 3D convolution kernels selected 256 3 × 3 × 3 carry out the characteristic pattern group that pond layer 2 exports Convolution operation, convolution step-length are 1, and the length for carrying out zero padding operation edged is 1, using amendment linear unit (ReLU) after convolution Function carries out Nonlinear Mapping, exports 256 characteristic pattern groups, and each characteristic pattern group includes the characteristic pattern that 8 sizes are 28 × 28；

Layer 7 is pond layer 3, the Chi Huahe of selection 2 × 2 × 2, the characteristic pattern group that convolutional layer 3 is exported with step-length 2 into Row down-sampling operates, and exports 256 characteristic pattern groups, and each characteristic pattern group includes the characteristic pattern that 4 sizes are 14 × 14；

8th layer is convolutional layer 4, and the 3D convolution kernels selected 256 3 × 3 × 3 carry out the characteristic pattern group that pond layer 3 exports Convolution operation, convolution step-length are 1, and the length for carrying out zero padding operation edged is 1, after convolution using correct linear unit function into Row Nonlinear Mapping, exports 256 characteristic pattern groups, and each characteristic pattern group includes the characteristic pattern that 4 sizes are 14 × 14；

9th layer is pond layer 4, selects the Chi Huahe of 2 × 2 × 2 sizes, the characteristic pattern exported to convolutional layer 4 with step-length 2 Group carries out down-sampling operation, exports 256 characteristic pattern groups, and each characteristic pattern group includes the characteristic pattern that 2 sizes are 7 × 7；

Tenth layer is full articulamentum, and the output of pond layer 4 is fully connected to 512 output neurons of this layer, output one Then the feature vector of a 512 dimension uses Dropout methods to adjust connection weight by it using ReLU function nonlinear transformations Weight, full linking number are 512；

Eleventh floor is classification layer, and using Softmax graders, the feature vector of the tenth layer of full articulamentum output is connected entirely 7 output nodes are connected to, 7 dimensional vectors are obtained after Softmax is returned, the numerical value of wherein each dimension is exactly to input The emotional category of video sequence belongs to the probability of corresponding classification；

After building above two 3D convolutional neural networks, with the expression video sequence and appearance in bimodal emotion video library State video sequence is respectively trained corresponding 3D convolutional neural networks as input, optimizes two using back-propagation algorithm The model parameter of kind 3D convolutional neural networks.

Step 3：The first 3D convolutional neural networks of expression video sequence pair concentrated using training set and verification are trained, The 2nd 3D convolutional neural networks of posture video sequence pair concentrated using training set and verification are trained, optimization network model ginseng Number.Wherein, training set is used for network training, after often training iteration preset times, is once tested on verification collection, for testing Whether the selection for demonstrate,proving network parameter is reasonable.

Step 4：The expression video sequence samples in test set are carried out using the first 3D convolutional neural networks after optimization Emotional semantic classification identifies, obtains 7 dimensional vectors, the maximum dimension of the numerical values recited of each more vectorial dimension, wherein numerical value The corresponding classification of degree is exactly the emotional category of the sample；All expression video sequence samples in test set are carried out to repeat survey Examination, statistical classification recognition result obtain expression emotional semantic classification identity confusion matrix E, i.e.,

Similarly, using the 2nd 3D convolutional neural networks after optimization to the posture video sequence sample in test set into market Feel Classification and Identification, obtains 7 dimensional vectors, the maximum dimension of the numerical values recited of each more vectorial dimension, wherein numerical value Corresponding classification is exactly the emotional category of the sample；All posture video sequence samples in test set are carried out to repeat survey Examination, statistical classification recognition result obtain posture emotional semantic classification identity confusion matrix G, i.e.,

Step 5：Using after optimization the first 3D convolutional neural networks and the 2nd 3D convolutional neural networks respectively to newly inputting Expression video sequence and posture video sequence carry out emotional semantic classification identification, obtain the emotional semantic classification of expression and posture both modalities which Recognition result；

Step 6：The expression emotional semantic classification identity confusion matrix E and posture emotional semantic classification identity confusion obtained using step 4 The emotional semantic classification recognition result for the both modalities which that step 5 obtains is weighted fusion by the priori of matrix G in decision-making level, Obtain the emotional semantic classification of bimodal as a result, specific steps are as follows：

(6.1) numerical value of the element on expression emotional semantic classification identity confusion matrix E leading diagonals is normalized, is obtained

(6.2) numerical value of the element on posture emotional semantic classification identity confusion matrix G leading diagonals is normalized, is obtained

(6.3) the emotional semantic classification recognition result of expression and posture both modalities which is weighted fusion, obtains one new 7 Dimensional vector V, i.e.,

A kind of bimodal emotion recognition method based on 3D convolutional neural networks that the embodiment of the present invention proposes and traditional Bimodal emotion recognition method is compared, the affective characteristics extracted relative to artificial design features have stronger characterization ability and Generalization ability, to finally promote the accuracy of Classification and Identification.In addition, the identification in decision-making level to expression and posture both modalities which When being as a result weighted fusion, using the priori of the emotional semantic classification identity confusion matrix of expression and posture both modalities which come really Surely the weighted value weighted can overcome when being merged using D-S evidence theory decision rule because of Basic probability assignment function Minor change and the unstability for causing fusion results completely different, and produced in processing conflict completely or height conflicting evidence Raw the problems such as running counter to the result of convention, the accuracy and robustness of emotion recognition can be effectively improved.

A kind of bimodal emotion recognition system based on 3D convolutional neural networks that another embodiment of the present invention provides, packet It includes：Preprocessing module will for obtaining everyone facial expression video clip and body posture video clip sample simultaneously Each video clip is trimmed into an isometric frame sequence, establishes the expression and posture bimodal feelings for including emotional category label Feel video library, and the sample of bimodal emotion video library is divided into training set, verification collection and test set；Network model trains mould Block, the expression video sequence concentrated using training set and verification and posture video sequence are respectively to the first 3D convolutional Neurals of structure Network and the 2nd 3D convolutional neural networks are trained, and optimize network model parameter；The first 3D convolutional neural networks and Two 3D convolutional neural networks include：Data input layer is used for input video sequence, and normalizing is carried out to the image in video sequence Change is handled；The composite module of at least two convolutional layer and pond layer, wherein convolutional layer are using several 3D convolution kernels to the defeated of last layer Go out and carry out convolution algorithm, pond layer is used for the output to convolutional layer and carries out down-sampling operation；Full articulamentum is used for last layer pond The output for changing layer is fully connected to the output neuron of this layer, exports a feature vector；And classification layer, for that will connect entirely The feature vector of layer output is connected to the output node for indicating emotional category entirely；Confusion matrix acquisition module, for being utilized respectively The first 3D convolutional neural networks and the 2nd 3D convolutional neural networks after optimization in test set expression video sequence samples and Posture video sequence sample carries out emotional semantic classification identification, and statistical classification recognition result, and obtained expression emotional semantic classification identification is mixed Confuse matrix and posture emotional semantic classification identity confusion matrix；Expression and posture emotional semantic classification identification module, utilize first after optimization 3D convolutional neural networks and the 2nd 3D convolutional neural networks respectively to the expression video sequence that newly inputs and posture video sequence into Market sense Classification and Identification obtains the emotional semantic classification recognition result of expression and posture both modalities which；And decision-making module, for profit The expression emotional semantic classification identity confusion matrix that is obtained with confusion matrix acquisition module and posture emotional semantic classification identity confusion matrix Priori, the emotional semantic classification recognition result for the both modalities which that expression and posture emotional semantic classification identification module are obtained is in decision-making level It is weighted fusion, obtains the emotional semantic classification result of bimodal.

The above-mentioned bimodal emotion recognition system embodiment based on 3D convolutional neural networks can be used for executing above-mentioned be based on The bimodal emotion recognition method embodiments of 3D convolutional neural networks, technical principle, it is solved the technical issues of and generation Technique effect is similar, the specific work process of the bimodal emotion recognition based on 3D convolutional neural networks of foregoing description and related Illustrate, the corresponding process in the aforementioned bimodal emotion recognition method embodiment based on 3D convolutional neural networks can be referred to, This is repeated no more.

It will be understood by those skilled in the art that can carry out adaptively changing and it to the module in embodiment Be arranged in the one or more systems different from the embodiment.Can in embodiment module or unit or component combine At a module or unit or component, and it can be divided into multiple submodule or subelement or sub-component in addition.

The above, the only specific implementation mode in the present invention, but scope of protection of the present invention is not limited thereto, appoints What is familiar with the people of the technology within the technical scope disclosed by the invention, it will be appreciated that expects transforms or replaces, and should all cover Within the scope of the present invention, therefore, the scope of protection of the invention shall be subject to the scope of protection specified in the patent claim.

Claims

1. a kind of bimodal emotion recognition method based on 3D convolutional neural networks, which is characterized in that include the following steps：

(1) while everyone facial expression video clip and body posture video clip sample is obtained, by each video Segment is trimmed into an isometric frame sequence, establishes the expression and posture bimodal emotion video library for including emotional category label, And the sample of bimodal emotion video library is divided into training set, verification collection and test set；

(2) utilize the expression video sequence and posture video sequence that training set and verification are concentrated respectively to the first 3D convolution of structure Neural network and the 2nd 3D convolutional neural networks are trained, and optimize network model parameter；The first 3D convolutional neural networks Include with the 2nd 3D convolutional neural networks：

The composite module of at least two convolutional layer and pond layer, wherein convolutional layer use output of several 3D convolution kernels to last layer Convolution algorithm is carried out, pond layer is used for the output to convolutional layer and carries out down-sampling operation；

Full articulamentum, the output neuron for the output of last layer pond layer to be fully connected to this layer, output one feature to Amount；

And classification layer, the feature vector for exporting full articulamentum is connected to the output node for indicating emotional category entirely, defeated Go out a n-dimensional vector, wherein n is emotional category number；

(3) the first 3D convolutional neural networks and the 2nd 3D convolutional neural networks being utilized respectively after optimization are to the expression in test set Video sequence sample and posture video sequence sample carry out emotional semantic classification identification, and statistical classification recognition result, obtain n × n's Expression emotional semantic classification identity confusion matrix E and posture emotional semantic classification identity confusion matrix G；

(4) using after optimization the first 3D convolutional neural networks and the 2nd 3D convolutional neural networks the expression newly inputted is regarded respectively Frequency sequence and posture video sequence carry out emotional semantic classification identification, obtain the emotional semantic classification identification knot of expression and posture both modalities which Fruit；

(5) the expression emotional semantic classification identity confusion matrix E and posture emotional semantic classification identity confusion matrix G that step (3) obtains are utilized Priori, the emotional semantic classification recognition result for the both modalities which that step (4) obtains is weighted fusion in decision-making level, is obtained The emotional semantic classification result of bimodal.

2. a kind of bimodal emotion recognition method based on 3D convolutional neural networks according to claim 1, feature exist In, the first 3D convolutional neural networks, including the 1 data input layer, at least two convolutional layer and the pond layer that are linked in sequence Composite module, 1 full articulamentum and 1 Softmax classification layer；

The data input layer is first layer, is inputted as expression video sequence, to every frame image progress normalizing in video sequence Change is handled；The length of the expression video sequence is 16,24 or 32 frames；

The composite module of the convolutional layer and pond layer, including 1 convolutional layer and 1 pond layer, wherein convolutional layer includes ReLU Nonlinear activation function layer selects m₁A d₁×k₁×k₁3D convolution kernels convolution algorithm is carried out to the output of last layer, wherein d₁、k₁It is chosen in 3,5,7 numerical value, m₁It is chosen in 32,64,128,256,512 numerical value；Pond layer choosing d₂×k₂×k₂Pond The output for changing verification last layer convolutional layer carries out down-sampling operation, wherein d₂、k₂It is chosen in 1,2,3 numerical value；

The output of last layer pond layer is fully connected to the c output neuron of this layer by the full articulamentum, exports what a c was tieed up Feature vector, wherein c chooses in 256,512,1024 numerical value；

The feature vector of the full articulamentum output of last layer is connected to n output node by the Softmax classification layers entirely, is passed through Softmax obtains a n-dimensional vector [p after returning₁ p₂ p₃ … p_n]^T, the numerical value of wherein each dimension is exactly to input expression The emotional category of video sequence belongs to the probability of corresponding classification；N is emotional category number.

3. a kind of bimodal emotion recognition method based on 3D convolutional neural networks according to claim 1, feature exist In, the 2nd 3D convolutional neural networks, including the 1 data input layer, at least two convolutional layer and the pond layer that are linked in sequence Composite module, 1 full articulamentum and 1 Softmax classification layer；

The data input layer is first layer, is inputted as posture video sequence, to every frame image progress normalizing in video sequence Change is handled；The length of the posture video sequence is 16,24 or 32 frames；

The composite module of the convolutional layer and pond layer, including 1 convolutional layer and 1 pond layer, wherein convolutional layer includes ReLU Nonlinear activation function layer selects m₂A d₃×k₃×k₃3D convolution kernels convolution algorithm is carried out to the output of last layer, wherein d₃、k₃It is chosen in 3,5,7 numerical value, m₂It is chosen in 32,64,128,256,512 numerical value；Pond layer choosing d₄×k₄×k₄Pond The output for changing verification last layer convolutional layer carries out down-sampling operation, wherein d₄、k₄It is chosen in 1,2,3 numerical value；

The feature vector of the full articulamentum output of last layer is connected to n output node by the Softmax classification layers entirely, is passed through Softmax obtains a n-dimensional vector [q after returning₁ q₂ q₃ … q_n]^T, the numerical value of wherein each dimension is exactly to input posture The emotional category of video sequence belongs to the probability of corresponding classification；N is emotional category number.

4. a kind of bimodal emotion recognition method based on 3D convolutional neural networks according to claim 1, feature exist In the step (5) includes：

(5.3) the emotional semantic classification recognition result of expression and posture both modalities which is weighted fusion, obtain a new n tie up to V is measured, i.e.,

Compare the numerical values recited of each dimension in vectorial V, the wherein classification corresponding to the maximum dimension of numerical value is exactly to input to regard The emotional category of frequency sequence；Wherein [p₁ p₂ p₃ … p_n]^T[q₁ q₂ q₃ … q_n]^TRespectively the first 3D convolutional Neural nets The recognition result vector of network and the classification layer output of the 2nd 3D convolutional neural networks.

5. a kind of bimodal emotion recognition system based on 3D convolutional neural networks, which is characterized in that including：

Preprocessing module, for obtaining everyone facial expression video clip and body posture video clip sample simultaneously, Each video clip is trimmed into an isometric frame sequence, establishes the expression and posture bimodal for including emotional category label Emotion video library, and the sample of bimodal emotion video library is divided into training set, verification collection and test set；

Network model training module, the expression video sequence concentrated using training set and verification and posture video sequence are respectively to structure The first 3D convolutional neural networks built and the 2nd 3D convolutional neural networks are trained, and optimize network model parameter；Described first 3D convolutional neural networks and the 2nd 3D convolutional neural networks include：Data input layer is used for input video sequence, to video sequence Image in row is normalized；The composite module of at least two convolutional layer and pond layer, wherein convolutional layer use several 3D Convolution kernel carries out convolution algorithm to the output of last layer, and pond layer is used for the output to convolutional layer and carries out down-sampling operation；Quan Lian Layer is connect, the output neuron for the output of last layer pond layer to be fully connected to this layer exports a feature vector；And Classification layer, the feature vector for exporting full articulamentum are connected to the output node for indicating emotional category entirely, export a n dimension Vector, wherein n are emotional category number；

Confusion matrix acquisition module, the first 3D convolutional neural networks and the 2nd 3D convolutional Neural nets for being utilized respectively after optimizing Network in test set expression video sequence samples and posture video sequence sample carry out emotional semantic classification identification, and statistical classification know Not as a result, obtaining the expression emotional semantic classification identity confusion matrix and posture emotional semantic classification identity confusion matrix of n × n；

Expression and posture emotional semantic classification identification module utilize the first 3D convolutional neural networks and the 2nd 3D convolutional Neurals after optimization Network carries out emotional semantic classification identification to the expression video sequence and posture video sequence that newly input respectively, obtains expression and posture two The emotional semantic classification recognition result of kind mode；

And decision-making module, the expression emotional semantic classification identity confusion matrix for being obtained using confusion matrix acquisition module and appearance The priori of state emotional semantic classification identity confusion matrix, the both modalities which that expression and posture emotional semantic classification identification module are obtained Emotional semantic classification recognition result is weighted fusion in decision-making level, obtains the emotional semantic classification result of bimodal.