CN107808146B

CN107808146B - Multi-mode emotion recognition and classification method

Info

Publication number: CN107808146B
Application number: CN201711144196.1A
Authority: CN
Inventors: 孙波; 何珺; 余乐军; 曹斯铭
Original assignee: Beijing Normal University
Current assignee: Beijing Normal University
Priority date: 2017-11-17
Filing date: 2017-11-17
Publication date: 2020-05-05
Anticipated expiration: 2037-11-17
Also published as: CN107808146A

Abstract

The invention provides a multi-modal emotion recognition and classification method, which comprises the steps of processing a video containing a human face to be detected and a video containing body actions within the same time, converting the video into an image time sequence consisting of image frames, extracting time characteristics and space characteristics in the image time sequence, carrying out multi-characteristic-level fusion on the characteristics based on obtained multi-layer depth space-time characteristics, and carrying out decision-level fusion on classification results, thereby recognizing the emotion types of tasks in the video to be detected from multiple modes.

Description

Multi-mode emotion recognition and classification method

Technical Field

The invention relates to the technical field of computer processing, in particular to a multi-mode emotion recognition and classification method.

Background

The emotion recognition is used as a new research field with multiple interdisciplines such as computer science, cognitive science, psychology, brain science, neuroscience and the like, and the research aim is to enable a computer to learn and understand the emotion expression of human beings, and finally enable the computer to have the ability of recognizing and understanding the emotion like the human beings. Therefore, as a highly challenging interdisciplinary subject, emotion recognition becomes a research hotspot in the fields of current home and abroad mode recognition, computer vision, big data mining and artificial intelligence, and has important research value and application prospect.

In the existing emotion recognition technology, the research trend of emotion recognition presents two obvious characteristics, on one hand, data is expanded from emotion recognition based on a static image to emotion recognition based on a dynamic image sequence; on the other hand, the method is extended to multi-modal-based emotion recognition by single-modal-based emotion recognition. At present, emotion recognition research based on static images has achieved a lot of good results, however, the emotion recognition method based on static images ignores time dynamic information of human expressions. The accuracy of the analysis of video data, as a whole, requires further research relative to picture-based emotion recognition. In addition, psychological research shows that emotion recognition is a multi-modal problem in nature, and the joint judgment of emotional states by using body postures and facial expressions has a better effect than that by using single-modal information. Compared with a single mode, the emotion recognition by utilizing multi-mode information fusion is more accurate and reliable. This makes multimodal information fusion also developed as a research hotspot in the field of emotion recognition.

In the prior art, the modal fusion method of facial expressions and body postures only adopts a single fusion mode, and selects one of feature level fusion or decision level fusion according to a certain strategy. In the prior art, effective spatio-temporal features cannot be extracted from video data for emotion recognition, and on the other hand, no matter early-stage fusion or later-stage fusion is adopted, similar fusion methods have the characteristics of model independence, effective information existing in each mode is not fully utilized, and the problem of low fusion efficiency generally exists.

Disclosure of Invention

In order to solve the problems that effective space-time characteristics cannot be extracted from video data for emotion recognition in the prior art, and similar fusion methods have the characteristics of model independence no matter early-stage fusion or later-stage fusion is adopted in emotion recognition, effective information existing in each mode is not fully utilized, and the fusion efficiency is not high generally, the multi-mode emotion recognition classification method is provided.

According to one aspect of the invention, the multi-modal emotion recognition classification method comprises the following steps:

s1, receiving data to be detected, wherein the data to be detected comprises a video containing a face and a corresponding video containing a body motion at the same time, and preprocessing the video containing the face and the corresponding video containing the body motion to obtain a face image time sequence containing the face and a body image time sequence containing the body motion;

s2, sequentially inputting the face image time sequence into a convolution neural network based on Alexnet and a circulation neural network based on BLSTM, taking out output data as a first face image space-time characteristic, sequentially inputting the body image time sequence into the convolution neural network based on Alexnet and the circulation neural network based on BLSTM, and taking out the output data as a first body image space-time characteristic;

s3, serially inputting the first face image space-time feature and the first body image space-time feature into a fully-connected neural network, obtaining a probability matrix belonging to different emotion types after the first face image space-time feature and the first body image space-time feature are fused, marking the probability matrix as a first probability matrix, simultaneously serially inputting the first face image space-time feature and the first body image space-time feature into a support vector machine, obtaining a probability matrix belonging to different emotion types after the first face image space-time feature and the first body image space-time feature are serially connected, and marking the probability matrix as a second probability matrix;

s4, inputting the first face image space-time feature into a support vector machine, obtaining probability matrixes of the first face image space-time feature belonging to different emotion types, marking the probability matrixes as third probability matrixes, inputting the first body image feature into the support vector machine, obtaining probability matrixes of the first body image space-time feature belonging to different emotion types, marking the probability matrixes as fourth probability matrixes, performing decision fusion on the first probability matrixes, the second probability matrixes, the third probability matrixes and the fourth probability matrixes, obtaining first fusion probability matrixes, and taking the highest probability emotion type in the first fusion probability matrixes as an emotion recognition result.

Wherein, before the step S1, the method further includes: and training the Alexnet-based convolutional neural network, the BLSTM-based cyclic neural network, the fully-connected neural network and the support vector machine.

In step S1, the preprocessing the video including the face and the corresponding video including the body motion specifically includes:

carrying out face detection and alignment processing on each frame of image in the video containing the face, and arranging the processed image frames according to a time sequence to obtain a face image time sequence;

and carrying out normalization processing on each frame image in the video containing the body movement, and arranging the processed image frames according to a time sequence to obtain a body image time sequence.

Wherein the step S1 further includes:

reading the mark of each image frame in the video containing the face, extracting the image frames marked as beginning, vertex and disappearance to form a face image time sequence;

reading the mark of each image frame in the video containing the body action, extracting the image frames marked as beginning, vertex and disappearance to form a body image time sequence;

wherein the markers of the image frame include a plateau, a start, a vertex, and a vanishing.

Wherein, the step S2 specifically includes:

s21, inputting the face image time sequence into a convolution neural network based on Alexnet, taking out data of the first two full connection layers of the three full connection layers as face space initial features, carrying out principal component analysis on the face space initial features so as to realize space conversion and dimensionality reduction, obtaining first face image space features, inputting the body image time sequence into the convolution neural network based on Alexnet, taking out data of the first two full connection layers of the three full connection layers as body space initial features, carrying out principal component analysis on the body space initial features so as to realize space conversion and dimensionality reduction, and obtaining first body image space features;

s22, inputting the first human face image space characteristic into a BLSTM-based recurrent neural network, taking out the data of the first two full connected layers in the three full connected layers as the human face space-time initial characteristic, carrying out principal component analysis on the human face space-time initial characteristic to realize space conversion and dimensionality reduction, obtaining the first human face image space-time characteristic, inputting the first human body image space characteristic into the BLSTM-based recurrent neural network, taking out the data of the first two full connected layers in the three full connected layers as the human body space-time initial characteristic, carrying out principal component analysis on the human body space-time initial characteristic, realizing space conversion and dimensionality reduction, and obtaining the first human body image space-time characteristic.

Wherein, the step S1 further includes:

and cutting the face image time sequence and the body image time sequence according to the preset length of the sliding window to obtain a face image time subsequence group consisting of a plurality of face image time sequence segments and a body image time subsequence group consisting of a plurality of body image time sequence segments.

Wherein the step S2 further includes:

sequentially inputting a plurality of face image time sequence segments in the face image time subsequence group into a convolution neural network based on Alexnet and a circulation neural network based on BLSTM, and taking out output data as second face image space-time characteristics;

and sequentially inputting a plurality of body image time sequence segments in the body image time subsequence group into a convolution neural network based on Alexnet and a circulation neural network based on BLSTM, and taking out output data as the space-time characteristics of the second body image.

Wherein, the step S2 further includes:

inputting a plurality of face image time sequences in the face image time subsequence group into a convolution neural network based on Alexnet, taking out data of the first two full connection layers in the three full connection layers as second face space initial features, performing principal component analysis on the second face space initial features to realize space conversion and dimension reduction, obtaining second face image space features, inputting a plurality of body image time sequences in the body image time subsequence group into the convolution neural network based on Alexnet, taking out data of the first two full connection layers in the three full connection layers as second body space initial features, and performing principal component analysis on the second body space initial features to realize space conversion and dimension reduction, so as to obtain second body image space features;

inputting the space characteristics of the second face image into a BLSTM-based recurrent neural network, taking out the data of the first two full connection layers in the three full connection layers as the space-time initial characteristics of the second face, performing principal component analysis on the space-time initial characteristics of the face to realize space conversion and dimensionality reduction, obtaining the space-time characteristics of the second face image, inputting the space characteristics of the second body image into the BLSTM-based recurrent neural network, taking out the data of the first two full connection layers in the three full connection layers as the space-time initial characteristics of the second body, performing principal component analysis on the space-time initial characteristics of the body to realize space conversion and dimensionality reduction, and obtaining the space-time characteristics of the second body image.

Wherein the step S3 further includes:

inputting the second face image space-time characteristic and the second body image space-time characteristic into a fully-connected neural network in series, inputting an output result into a support vector machine, obtaining probability matrixes which belong to different emotion types after the second face image space-time characteristic and the second body image space-time characteristic are fused, marking the probability matrixes as fifth probability matrixes, simultaneously inputting the second face image space-time characteristic and the second body image space-time characteristic into the support vector machine in series, obtaining probability matrixes which belong to different emotion types after the second face image space-time characteristic and the second body image space-time characteristic are fused, and marking the probability matrixes as sixth probability matrixes.

Wherein the step S4 further includes:

inputting the first face image space-time characteristics into a support vector machine to obtain probability matrixes of the first face image space-time characteristics belonging to different emotion types, marking the probability matrixes as third probability matrixes, inputting the first body image characteristics into the support vector machine to obtain probability matrixes of the first body image space-time characteristics belonging to different emotion types, marking the probability matrixes as fourth probability matrixes, and performing decision fusion on the fifth probability matrixes, the sixth probability matrixes, the seventh probability matrixes and the eighth probability matrixes to obtain second fusion probability matrixes;

and performing decision fusion on the first fusion probability matrix and the second fusion probability matrix to obtain a third fusion probability matrix, and taking the emotion type with the highest probability in the third fusion probability matrix as an emotion recognition result.

The method provided by the invention adopts a multi-mode combined emotion recognition method, fully utilizes effective information of various modes in the video to be detected, improves the fusion efficiency, and simultaneously improves the accuracy of emotion recognition.

Drawings

Fig. 1 is a flowchart of a multi-modal emotion recognition classification method according to an embodiment of the present invention;

FIG. 2 is a comparison graph of emotion recognition rates based on time series and using different fusion strategies in the multi-modal emotion recognition classification method provided by an embodiment of the present invention;

FIG. 3 is a schematic diagram of a neural network structure for extracting spatiotemporal features in a multi-modal emotion recognition classification method according to an embodiment of the present invention;

FIG. 4 is a schematic diagram illustrating segmentation of a time sequence by using a sliding window in a multi-modal emotion recognition classification method according to an embodiment of the present invention;

FIG. 5 is a comparison graph of emotion recognition rates based on time sequence segments and using different fusion strategies in the multi-modal emotion recognition classification method provided by an embodiment of the present invention;

fig. 6 is an emotion recognition rate comparison diagram obtained by fusing a time sequence and a time sequence segment according to the multi-modal emotion recognition classification method provided by the embodiment of the present invention.

Detailed Description

The following detailed description of embodiments of the present invention is provided in connection with the accompanying drawings and examples. The following examples are intended to illustrate the invention but are not intended to limit the scope of the invention.

Referring to fig. 1, fig. 1 is a flowchart of a multi-modal emotion recognition and classification method according to an embodiment of the present invention, where the method includes:

s1, receiving data to be detected, wherein the data to be detected comprises a video containing a human face and a corresponding video containing a body action, and preprocessing the video containing the human face and the corresponding video containing the body action to obtain a human face image time sequence and a body image time sequence.

Specifically, a video containing facial expressions of people and a video containing body movements within the same time are received, the videos are preprocessed, and then the video of the face and the video of the body movements are arranged according to image frames respectively, so that a face image time sequence and a body image time sequence which are formed by the image frames in the videos are obtained.

By the method, the video data are converted into the image frame sequence, the operability of the data is improved, and the data can be conveniently processed subsequently.

And S2, sequentially inputting the face image time sequence into a convolutional neural network based on Alexnet and a cyclic neural network based on BLSTM, taking out output data as a first face image space-time characteristic, sequentially inputting the body image time sequence into the convolutional neural network based on Alexnet and the cyclic neural network based on BLSTM, and taking out the output data as a first body image space-time characteristic.

Specifically, the face image time series and the body image time series obtained in S1 are input into a trained Alexnet-based convolutional neural network and a BLSTM-based recurrent neural network, respectively, spatial features of the image time series can be obtained from the time series through the Alexnet-based convolutional neural network, and spatial and temporal features in the image time series can be further obtained from the obtained spatial features through the recurrent neural network. In this embodiment, the face image time sequence and the body image time sequence are respectively input into the trained convolution neural network based on Alexnet and the trained circulation neural network based on BLSTM, so that the spatiotemporal features of the face image spatiotemporal sequence, that is, the spatiotemporal features of the first face image, and the spatiotemporal features of the body image sequence, that is, the spatiotemporal features of the first body image, can be respectively obtained.

By the method, a depth network combining a convolution neural network based on Alexnet and a circulation neural network based on BLSTM is constructed, and local and global space-time characteristics are extracted, so that the face image time sequence and the body image time sequence can be classified according to the acquired multilayer depth space-time characteristics.

And S3, serially inputting the first face image space-time characteristic and the first body image space-time characteristic into a fully-connected neural network, inputting an output result into a support vector machine, obtaining probability matrixes belonging to different emotion types after the first face image space-time characteristic and the first body image space-time characteristic are fused, marking the probability matrixes as first probability matrixes, simultaneously serially inputting the first face image space-time characteristic and the first body image space-time characteristic into the support vector machine, obtaining probability matrixes belonging to different emotion types after the first face image space-time characteristic and the first body image space-time characteristic are serially connected, and marking the probability matrixes as second probability matrixes.

Specifically, the first face image spatiotemporal feature and the first body image spatiotemporal feature are connected in series and input into a trained fully-connected neural network, an output result is input into a trained support vector machine, and the probability that the combined feature of the first face image spatiotemporal feature and the first body image spatiotemporal feature belongs to different emotion categories can be obtained according to two mode combinations of the first face image spatiotemporal feature and the first body image spatiotemporal feature, so that a first classification probability matrix is constructed.

Preferably, the data of the last but one full-connected layer is taken from the output data of the full-connected neural network to perform principal component analysis for dimensionality reduction, and then the processed data is input into a trained support vector machine to obtain a probability classification result with higher precision.

On the other hand, the first face image space-time feature and the first body image space-time feature are connected in series, and then the connected features are input into a trained support vector machine, so that the probability that the combined features of the first face image space-time feature and the first body image space-time feature belong to different emotion categories can be obtained, and a second classification probability matrix is constructed.

In the process of connecting the first face image space-time characteristic and the first body image space-time characteristic in series, the dimension of the connected characteristics can be reduced through principal component analysis, and then the features after dimension reduction are input into a trained support vector machine so as to obtain probability output. By the method, the characteristics of the human face and the characteristics of the body action are fused through characteristic level fusion, different fusion strategies including a neural network fusion strategy and a characteristic series fusion strategy are adopted, and probability matrixes of video data belonging to different emotion categories can be obtained respectively.

By the method, the characteristics of the human face and the characteristics of the body action are fused through characteristic level fusion, different fusion strategies including a neural network fusion strategy and a characteristic series fusion strategy are adopted, and probability matrixes of video data belonging to different emotion categories can be obtained respectively.

Specifically, the first face image space-time characteristics are independently input into a trained support vector machine, so that probability matrixes of the first face image space-time characteristics belonging to different emotion categories can be obtained, a third probability matrix is constructed through the probability matrixes, on the other hand, the first body image space-time characteristics are independently input into the trained support vector machine, so that probability matrixes of the first body image space-time characteristics belonging to different emotion categories can be obtained, and a fourth probability matrix is constructed through the probability matrixes.

Referring to fig. 2, fig. 2 is an emotion recognition rate comparison graph that adopts different fusion strategies based on time sequences in the multi-modal emotion recognition classification method provided in an embodiment of the present invention, and decision fusion is performed on four obtained probability matrices to obtain a new fused probability matrix, where the probability matrix includes a set of probabilities that data to be detected belong to different emotion categories, and in this set, an emotion category with the highest probability is selected as a final recognition result.

According to the method, the facial image expression of a person and the body action in the same time period are combined, the spatiotemporal features of the data to be detected are extracted by using the deep neural network, and the spatiotemporal features are classified according to different fusion strategies by the support vector machine, so that multi-modal emotion recognition is finally realized, effective information in each mode is fully utilized, and the emotion recognition accuracy probability is improved.

On the basis of the above embodiment, the step S1 is preceded by: and training the Alexnet-based convolutional neural network, the BLSTM-based cyclic neural network, the fully-connected neural network and the support vector machine.

Specifically, 127 videos in the FABO database are used for training a convolutional neural network based on Alexnet, a cyclic neural network based on BLSTM, a fully-connected neural network, and a support vector machine.

The feature extraction model is obtained by training an Alexnet-based convolutional neural network and a BLSTM-based cyclic neural network using an image sequence having variations in the face and body, and adjusting network parameters. Spatio-temporal features of body pose using spatio-temporal features of different facial activities are input into a support vector machine, an emotion classification model.

On the basis of the foregoing embodiment, the preprocessing the video including the human face and the corresponding video including the body motion in step S1 specifically includes: carrying out face detection and alignment processing on each frame of image in the video containing the face, and arranging the processed image frames according to a time sequence to obtain a face image time sequence; and carrying out normalization processing on each frame of image in the video containing the body movement, and arranging the processed image frames according to a time sequence to obtain a body image time sequence.

Specifically, the face detection operation and the alignment processing are performed on each image frame in the video containing the face, then each processed image frame is arranged according to the time sequence, so that a face image time sequence is obtained, meanwhile, the normalization processing is performed on the image frames in the video containing the body movement, so that the formats of the image frames of each image frame are consistent, and then the processed image frame groups are arranged according to the time sequence, so that the body image time sequence is formed.

By the method, the formats of each frame of image in the face image time sequence and the body image time sequence are the same, and subsequent operations such as feature extraction are facilitated.

On the basis of the above embodiment, the step S1 further includes: reading the mark of each image frame in the video containing the face, extracting the image frames marked as beginning, vertex and disappearance to form a face image time sequence; and reading the mark of each image frame in the video containing the body motion, and extracting the image frames marked as beginning, vertex and disappearance to form a body image time sequence. Wherein the markers of the image frame include a plateau, a start, a vertex, and a vanishing.

Specifically, in a database of data to be detected, each frame of a video is marked, all image frames at the beginning stage of an expression action are marked as "beginning", a time period when the expression action reaches the maximum is marked as "peak", all image frames within the time period when the expression action ends are marked as "end", and other image frames without expression are marked as "flat".

In emotion recognition using a face image time series and a body image time series, the time sequence composed of images containing all image frames can be used, the time sequence composed of only image frames in the time period when the expression action reaches the maximum can be selected, preferably, the image frames before the expression action starts and after the expression action finishes are discarded, only partial image frames from the expression action starts to the expression action finishes are selected for classification processing, the image frames marked as 'start', 'vertex' and 'disappearance' are extracted to form the time sequence, therefore, the overall recognition accuracy can be improved, the table 1 shows the result of emotion recognition through the face video based on different image frame extraction methods, and the table 2 shows the result of emotion recognition through body actions based on different image frame extraction methods.

TABLE 1

Time series screening method	MAA(％)	ACC(％)
			Vertex sequence	55.90	56.84
Start-vertex-vanish sequence	57.56	61.11
			All sequences of the whole cycle	51.67	53.85

TABLE 2

Time series screening method	MAA(％)	ACC(％)
			Vertex sequence	45.88	50.60
Start-vertex-vanish sequence	48.98	51.70
			All sequences of the whole cycle	44.50	49.77

As can be seen from tables 1 and 2, the emotion recognition performed when the image frames marked as "start", "vertex" and "disappear" in the video are selected to form the time sequence has a higher recognition rate than other schemes. Wherein MAA represents the macro average accuracy, ACC represents the overall accuracy, and the calculation formula specifically comprises:

P_i＝TP_i/(TP_i+FP_i)

wherein s is the number of emotion categories, P_iThe accuracy of the ith emotion is shown, i is the number of correct classifications in the ith class, FP_iRefers to the number of misclassifications in class i.

On the basis of the foregoing embodiment, the step S2 specifically includes:

s22, inputting the first human face image space characteristic into a BLSTM-based recurrent neural network, taking out the data of the first two full connection layers in the three full connection layers as the human face space-time initial characteristic, carrying out principal component analysis on the human face space-time initial characteristic, realizing space conversion and dimensionality reduction, obtaining the first human face image space-time characteristic, inputting the first human body image space characteristic into the BLSTM-based recurrent neural network, taking out the data of the first two full connection layers in the three full connection layers as the human body space-time initial characteristic, carrying out principal component analysis on the human body space-time initial characteristic, realizing space conversion and dimensionality reduction, and obtaining the first human body image space-time characteristic.

Specifically, referring to fig. 3, in order to obtain multi-layer depth spatio-temporal features in a face image time sequence and a body image time sequence, feature extraction in an image space needs to be implemented by means of a convolutional neural network, and further, a cyclic neural network is used to extract time information in the image sequence, in this embodiment, spatial features in the face image time sequence and the body image time sequence are respectively extracted by using a convolutional neural network based on Alexnet, preferably, in the convolutional neural network based on Alexnet, the last three layers are all fully connected layers, output feature dimensions are respectively 1024 dimensions, 512 dimensions and 10 dimensions, output data of the first 2 layers in three fully connected layers are used as output initial spatial features, where the extracted initial feature dimensions total 1536 dimensions, main component analysis is performed on the 1536-dimensional features, thereby implementing spatial transformation and dimension reduction processing, the latitude reaches the input standard of a BLSTM-based recurrent neural network, the output data of the middle and front 2 layers of the last three full-connection layers are extracted as initial space-time characteristics, wherein the initial space-time characteristics are also 1536-dimensional, principal component analysis is carried out on 1536-dimensional characteristic points of the initial space-time characteristics, and therefore space conversion and dimension reduction processing are achieved, and finally space-time characteristics are obtained. In this step, the face image time series is sequentially input to the trained Alexnet-based convolutional neural network and the trained BLSTM-based cyclic neural network, thereby obtaining the face image spatiotemporal features, and similarly, the body image time series is sequentially input to the trained Alexnet-based convolutional neural network and the trained BLSTM-based cyclic neural network, thereby obtaining the body image spatiotemporal features, which are labeled as the first face image spatiotemporal features and the first body image spatiotemporal features.

By the method, the extraction of the spatial features and the extraction of the temporal features of the time series of the images are realized.

On the basis of the foregoing embodiments, the step S1 further includes: and cutting the face image time sequence and the body image time sequence according to the preset length of the sliding window to obtain a face image time subsequence group consisting of a plurality of face image time sequence segments and a body image time subsequence group consisting of a plurality of body image time sequence segments.

Specifically, after a face image time sequence and a body image time sequence are obtained, the time sequence is cut through a sliding window with preset window length, as shown in fig. 4, a face image time sequence with a length of 15 includes 5 frame image frames marked as "start", 5 frame image frames marked as "vertex", 5 frame image marks as "disappear", the sequence is cut through a sliding window with a length of 6 and a sliding step length of 1, 10 face image time sequence segments with a length of 6 can be obtained after the face image time sequence with a length of 15 passes through the set sliding window, a face image time subsequence group is formed, wherein the length of the sliding window is defined as far as possible to ensure that the cut time sequence segments include at least two types of image frames of three types of image frames including "start", "vertex" and "end", and cutting the body image time sequence, and forming the body image time sequence segments obtained after cutting into a body segment time subsequence group.

Table 3 shows the emotion recognition results based on the face image time series at different sliding window lengths, and table 4 shows the emotion recognition results based on the body image time series at different sliding window lengths.

TABLE 3

t	6	7	8	9	10
						MAA(％)	58.61	60.45	67.09	58.48	56.13
ACC(％)	59.00	61.25	66.46	59.03	57.21

TABLE 4

t	6	7	8	9	10
						MAA(％)	43.66	55.00	50.20	47.33	45.81
ACC(％)	44.85	55.98	51.83	48.76	46.00

As can be seen from tables 3 and 4, when the sliding window length is selected to have a suitable length, the recognition accuracy is higher than that of the emotion recognition mode in tables 1 and 2 in which the entire time series is used without time series division.

On the basis of the foregoing embodiments, the step S2 further includes: sequentially inputting a plurality of face image time sequence segments in the face image time subsequence group into a convolution neural network based on Alexnet and a circulation neural network based on BLSTM to obtain a second face image space-time characteristic; and sequentially inputting a plurality of body image time sequence segments in the body image time subsequence group into a convolution neural network based on Alexnet and a circulation neural network based on BLSTM to obtain the space-time characteristics of a second body image.

Specifically, a plurality of face image time sequence segments in the face image time subsequence set and a plurality of body image time sequence segments in the body image time subsequence set are input into a trained convolution neural network based on Alexnet and a trained circulation neural network based on BLSTM, and the spatiotemporal features of all the time sequence segments in the face image time subsequence set and the spatiotemporal features of all the time sequence segments in the body motion image time subsequence set are respectively obtained and marked as a second face image spatiotemporal feature and a second body motion image spatiotemporal feature.

By the method, the feature extraction is carried out on the plurality of time sequence segments after the segmentation, so that the new human face image space-time feature and the new body motion image space-time feature can be obtained and used for classifying the classifier.

On the basis of the foregoing embodiments, the step S2 further includes:

Specifically, consistent with the method for extracting the first face spatiotemporal feature and the first body spatiotemporal feature in the foregoing embodiment, in order to obtain spatiotemporal features of multiple depths in a face image time sequence and a body image time sequence, feature extraction in an image space needs to be realized by means of a convolutional neural network, and then time information in an image needs to be further extracted by using a cyclic neural network. Here, the extraction manner of the features in the neural network is the same as that in the above embodiment, and the details are not described here.

On the basis of the foregoing embodiments, the step S3 further includes: inputting the second face image space-time characteristic and the second body image space-time characteristic into a fully-connected neural network in series, inputting an output result into a support vector machine, obtaining probability matrixes which belong to different emotion types after the second face image space-time characteristic and the second body image space-time characteristic are fused, marking the probability matrixes as fifth probability matrixes, simultaneously inputting the second face image space-time characteristic and the second body image space-time characteristic into the support vector machine in series, obtaining probability matrixes which belong to different emotion types after the second face image space-time characteristic and the second body image space-time characteristic are connected in series, and marking the probability matrixes as sixth probability matrixes.

Specifically, the second face image space-time feature and the second body image space-time feature are connected in series and input into a trained fully-connected neural network, data of a last but one fully-connected layer in the fully-connected neural network is used as output data, after principal component analysis is carried out, the output data is input into a trained support vector machine, so that the probability that the second face image space-time feature and the second body image space-time feature belong to different emotion categories is obtained according to two mode combinations of the second face image space-time feature and the second body image space-time feature, and a fifth classification probability matrix is constructed.

On the other hand, the second face image space-time feature and the second body image space-time feature are connected in series, and then the features after being connected in series are input into a trained support vector machine, so that the probabilities that the second face image space-time feature and the second body image space-time feature after being connected in series belong to different emotion categories can be obtained, and the probabilities are combined to construct a sixth classification probability matrix.

On the basis of the above embodiment, the step S4 further includes: inputting the first face image space-time characteristics into a support vector machine to obtain probability matrixes of the first face image space-time characteristics belonging to different emotion types, marking the probability matrixes as seventh probability matrixes, inputting the first body image characteristics into the support vector machine to obtain probability matrixes of the first body image space-time characteristics belonging to different emotion types, marking the probability matrixes as eighth probability matrixes, and performing decision fusion on the fifth probability matrixes, the sixth probability matrixes, the seventh probability matrixes and the eighth probability matrixes to obtain second fusion probability matrixes; performing decision fusion on the first fusion probability matrix and the second fusion probability matrix to obtain a third fusion probability matrix, and taking the emotion type with the highest probability in the third fusion probability matrix as an emotion recognition result

Specifically, the second face image space-time feature is separately input into a trained support vector machine, so that a probability matrix that the second face image space-time feature belongs to different emotion classes can be obtained, and the probability matrix is marked as a seventh probability matrix, on the other hand, the second body image space-time feature is separately input into the trained support vector machine, so that a probability matrix that the second body image space-time feature belongs to different emotion classes can be obtained, and the probability matrix is marked as an eighth probability matrix.

Referring to fig. 5, it can be seen from fig. 5 that the emotion type recognition rate shown in Multi4-2 in fig. 5 can be achieved by comparing the recognition rates of emotion recognition based on the fifth probability matrix, the sixth probability matrix, the seventh probability matrix and the eighth probability matrix, and performing decision fusion on the fifth probability matrix, the sixth probability matrix, the seventh probability matrix and the eighth probability matrix to generate a second fused probability matrix.

And finally, performing decision-level fusion on the first fusion probability matrix and the second fusion probability matrix according to probability decision to obtain a third fusion probability matrix, and selecting the emotion category with the highest probability in the set as a final recognition result. Referring to fig. 6, fig. 6 shows that the emotion recognition accuracy of 99% or more can be obtained by performing emotion recognition on the entire time series and the time series segment groups subjected to sliding window segmentation, respectively, and then fusing the recognition results, with the emotion recognition rate of the first fusion probability matrix, the emotion recognition rate of the second fusion probability matrix, and the emotion recognition rate of the third fusion probability matrix.

By the method, the emotion recognition method combined with multiple modes is adopted, effective information of various modes in the video to be detected is fully utilized, the fusion efficiency is improved, and meanwhile the emotion recognition accuracy is improved.

Finally, the method of the present application is only a preferred embodiment and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A multi-modal emotion recognition classification method is characterized by comprising the following steps:

s3, serially inputting the first face image space-time feature and the first body image space-time feature into a fully-connected neural network, inputting an output result into a support vector machine, obtaining probability matrixes belonging to different emotion types after the first face image space-time feature and the first body image space-time feature are fused, marking the probability matrixes as first probability matrixes, simultaneously serially inputting the first face image space-time feature and the first body image space-time feature into the support vector machine, obtaining probability matrixes belonging to different emotion types after the first face image space-time feature and the first body image space-time feature are serially connected, and marking the probability matrixes as second probability matrixes;

s4, inputting the first face image space-time feature into a support vector machine, obtaining probability matrixes of the first face image space-time feature belonging to different emotion types, marking the probability matrixes as third probability matrixes, inputting the first body image feature into the support vector machine, obtaining probability matrixes of the first body image space-time feature belonging to different emotion types, marking the probability matrixes as fourth probability matrixes, performing decision fusion on the first probability matrix, the second probability matrix, the third probability matrix and the fourth probability matrixes, obtaining first fusion probability matrixes, and taking the highest probability emotion type in the first fusion probability matrixes as an emotion recognition result;

normalizing each frame image in the video containing the body movement, and arranging the processed image frames according to a time sequence to obtain a body image time sequence;

wherein the step S1 further includes:

2. The method according to claim 1, wherein the step S1 is preceded by: and training the Alexnet-based convolutional neural network, the BLSTM-based cyclic neural network, the fully-connected neural network and the support vector machine.

3. The method according to claim 1, wherein the step S2 specifically includes:

4. The method according to any one of claims 1 to 3, wherein the step S1 further comprises:

5. The method according to claim 4, wherein the step S2 further comprises:

6. The method according to claim 5, wherein the step S2 further comprises:

7. The method according to claim 6, wherein the step S3 further comprises:

inputting the second face image space-time characteristic and the second body image space-time characteristic into a fully-connected neural network in series, inputting an output result into a support vector machine, obtaining probability matrixes which belong to different emotion types after the second face image space-time characteristic and the second body image space-time characteristic are fused, marking the probability matrixes as fifth probability matrixes, simultaneously inputting the second face image space-time characteristic and the second body image space-time characteristic into the support vector machine in series, obtaining probability matrixes which belong to different emotion types after the second face image space-time characteristic and the second body image space-time characteristic are connected in series, and marking the probability matrixes as sixth probability matrixes.

8. The method according to claim 7, wherein the step S4 further comprises:

inputting the first face image space-time characteristics into a support vector machine to obtain probability matrixes of the first face image space-time characteristics belonging to different emotion types, marking the probability matrixes as seventh probability matrixes, inputting the first body image characteristics into the support vector machine to obtain probability matrixes of the first body image space-time characteristics belonging to different emotion types, marking the probability matrixes as eighth probability matrixes, and performing decision fusion on the fifth probability matrixes, the sixth probability matrixes, the seventh probability matrixes and the eighth probability matrixes to obtain second fusion probability matrixes;