CN107808146A

CN107808146A - A kind of multi-modal emotion recognition sorting technique

Info

Publication number: CN107808146A
Application number: CN201711144196.1A
Authority: CN
Inventors: 孙波; 何珺; 余乐军; 曹斯铭
Original assignee: Beijing Normal University
Current assignee: Beijing Normal University
Priority date: 2017-11-17
Filing date: 2017-11-17
Publication date: 2018-03-16
Anticipated expiration: 2037-11-17
Also published as: CN107808146B

Abstract

The present invention provides a kind of multi-modal emotion recognition sorting technique, methods described is included to the video comprising face to be detected and the video comprising body action is handled in the corresponding same time, it is transformed into the temporal sequence of images being made up of picture frame, extract the temporal characteristics and space characteristics in temporal sequence of images, more layer depth space-time characteristics based on acquisition, various features level fusion is carried out to feature, and decision level fusion is carried out to classification results, so as to the affective style of the task from multi-modal upper identification video to be detected, method provided by the invention, take full advantage of effective information present in each mode, improve the discrimination of emotion recognition.

Description

A kind of multi-modal emotion recognition sorting technique

Technical field

The present invention relates to computer processing technology field, more particularly, to a kind of multi-modal emotion recognition sorting technique.

Background technology

Emotion recognition is as multi-crossed disciplines such as computer science, cognitive science, psychology, brain science, Neuscience Emerging research field, its research purpose are exactly the emotional expression for allowing computer learning to understand the mankind, finally can be as the mankind Equally there is identification, understand the ability of emotion.Therefore, as a cross discipline for being rich in challenge, emotion recognition, which turns into, to be worked as Preceding pattern-recognition both at home and abroad, computer vision, big data is excavated and a study hotspot of artificial intelligence field, has important Researching value and application prospect.

In existing emotion recognition technology, the research tendency of emotion recognition show two it is more obvious the characteristics of, one Aspect, data expand to the emotion recognition based on dynamic image sequence by the emotion recognition based on still image；On the other hand, by Emotion recognition based on single mode is expanded to based on multi-modal emotion recognition.At present, the emotion recognition based on still image is ground Study carefully and have been achieved for a collection of good achievement, however, the emotion identification method based on static images have ignored human body expression when Between multidate information.As a whole, the relative emotion recognition based on picture, the accuracy of analysis of video data are also needed into traveling The research of one step.In addition, psychological study shows, emotion recognition is substantially multi-modal problem, utilizes body posture and face Expression judges that affective state Billy has more preferable effect with single mode information jointly.For single mode, multi-modal letter is utilized Breath is merged to identify that emotion can be more accurately and reliably.This causes multimodal information fusion also to develop into the one of emotion recognition field Individual study hotspot.

In the prior art, facial expression and the modality fusion method of body posture be all only with single amalgamation mode, Selected according to certain strategy from feature-based fusion or decision level fusion a kind of.In the prior art, can not be from video data Extract effective space-time characteristic and carry out emotion recognition, on the other hand, either merged using early stage or later stage, similar melts Conjunction method all has the characteristics of model is unrelated, does not make full use of effective information present in each mode, generally existing fusion effect The problem of rate is not high.

The content of the invention

To solve in the prior art, effective space-time characteristic can not be extracted from video data and carries out asking for emotion recognition Topic, and to either being merged in emotion recognition using early stage or later stage, similar fusion method all has model unrelated Feature, effective information present in each mode is not made full use of, the problem of generally existing fusion efficiencies are not high, there is provided Yi Zhongduo Mode emotion recognition sorting technique.

According to an aspect of the present invention, a kind of multi-modal emotion recognition sorting technique, including：

S1, receives testing data, and the testing data includes the video comprising face and included in the corresponding same time The video of body action, the video comprising face and the corresponding video comprising body action are pre-processed, obtained Facial image time series comprising face and the body image time series comprising body action；

S2, the facial image time series is sequentially inputted to the convolutional neural networks based on Alexnet and is based on In BLSTM Recognition with Recurrent Neural Network, the data of output are taken out, as the first facial image space-time characteristic, by the body image Time series is sequentially inputted in the convolutional neural networks based on Alexnet and the Recognition with Recurrent Neural Network based on BLSTM, is taken out defeated The data gone out, as the first body image space-time characteristic；

S3, the first facial image space-time characteristic and the first body image space-time characteristic series connection are input to and connected entirely Connect in neutral net, after obtaining the first facial image space-time characteristic and the first body image space-time characteristic fusion, category In the probability matrix of different emotions type, this probability matrix is labeled as the first probability matrix, while by the first face figure Connect and be input in SVMs as space-time characteristic and the first body image space-time characteristic, obtain the first face figure After space-time characteristic and the first body image space-time characteristic series connection, belong to the probability matrix of different emotions type, this is general Rate matrix is labeled as the second probability matrix；

S4, the first facial image space-time characteristic is input in SVMs, obtains first facial image Space-time characteristic belongs to the probability matrix of different emotions type, this probability matrix is labeled as into the 3rd probability matrix, by described first Body image feature is input in SVMs, is obtained the first body image space-time characteristic and is belonged to different emotions type Probability matrix, this probability matrix is labeled as the 4th probability matrix, by first probability matrix, the second probability matrix, the 3rd Probability matrix and the 4th probability matrix carry out Decision fusion, obtain the first fusion probability matrix, probability square is merged by described first Probability highest emotion type is as emotion recognition result in battle array.

Wherein, also include before the step S1：To the convolutional neural networks based on Alexnet, based on BLSTM's Recognition with Recurrent Neural Network, full Connection Neural Network and SVMs are trained.

Wherein, the video comprising face and the corresponding video comprising body action are pre-processed in step S1 Specifically include：

Face datection and registration process are carried out to each two field picture in the video comprising face, by the image after processing Frame is sequentially arranged, and obtains the facial image time series；

Each two field picture in entering to the video comprising body action is normalized, by the image after processing Frame arranges sequentially in time, obtains body image time series.

Wherein, the step S1 further comprises：

The mark of each picture frame in the reading video comprising face, extraction is labeled as beginning, summit and disappearance Picture frame, form facial image time series；

Read the mark of each picture frame in the video comprising body action, extraction labeled as start, summit and The picture frame of disappearance, form body image time series；

Wherein, the mark of described image frame include calming down, start, summit and disappearance.

Wherein, the step S2 is specifically included：

S21, the facial image time series is input in the convolutional neural networks based on Alexnet, takes out three The data of the full articulamentum of the first two are carried out face space initial characteristicses as face space initial characteristicses in full articulamentum Principal component analysis, so as to realize space conversion and dimensionality reduction, the first facial image space characteristics are obtained, by the body image time Sequence inputting takes out the data of the full articulamentum of the first two in three full articulamentums into the convolutional neural networks based on Alexnet As body space initial characteristicses, the body space initial characteristicses are subjected to principal component analysis, so as to realize space conversion and Dimensionality reduction, obtain the first body image space characteristics；

S22, the first facial image space characteristics are input to based in BLSTM Recognition with Recurrent Neural Network, take out three The data of the full articulamentum of the first two are carried out the face space-time initial characteristicses as face space-time initial characteristicses in full articulamentum Principal component analysis, so as to realize space conversion and dimensionality reduction, the first facial image space-time characteristic is obtained, by first body image Space characteristics are input to based in BLSTM Recognition with Recurrent Neural Network, take out the data of the full articulamentum of the first two in three full articulamentums As body space-time initial characteristicses, the body space-time initial characteristicses are subjected to principal component analysis, so as to realize space conversion and Dimensionality reduction, obtain the first body image space-time characteristic.

Wherein, also include in the step S1：

By default sliding window length, the facial image time series and the body image time series are entered Row cutting, obtains the facial image time subsequence group being made up of multiple facial image time series fragments and multiple body images The body image time subsequence group of time series fragment composition.

Wherein, the step S2 further comprises：

Multiple facial image time series fragments in the facial image time subsequence group are sequentially inputted to be based on In Alexnet convolutional neural networks and Recognition with Recurrent Neural Network based on BLSTM, the data of output are taken out, as the second face Image space-time characteristic；

Multiple body image time series fragments in the body image time subsequence group are sequentially inputted to be based on In Alexnet convolutional neural networks and Recognition with Recurrent Neural Network based on BLSTM, the data of output are taken out, as the second body Image space-time characteristic.

Wherein, also include in the step S2：

Multiple facial image time serieses in the facial image time subsequence group are input to based on Alexnet's In convolutional neural networks, the data for taking out the full articulamentum of the first two in three full articulamentums are initially special as the second face space Sign, the second face space initial characteristicses are subjected to principal component analysis, so as to realize space conversion and dimensionality reduction, obtain the second people Face image space characteristics, multiple body image time serieses in the body image time subsequence group are input to and are based on In Alexnet convolutional neural networks, the data for taking out the full articulamentum of the first two in three full articulamentums are empty as the second body Between initial characteristicses, by the second body space initial characteristicses carry out principal component analysis, so as to realize space conversion and dimensionality reduction, obtain Obtain the second body image space characteristics；

The second facial image space characteristics are input to based in BLSTM Recognition with Recurrent Neural Network, three is taken out and connects entirely The data of the full articulamentum of the first two in layer are connect as the second face space-time initial characteristicses, the face space-time initial characteristicses are carried out Principal component analysis, so as to realize space conversion and dimensionality reduction, the second facial image space-time characteristic is obtained, by second body image Space characteristics are input to based in BLSTM Recognition with Recurrent Neural Network, take out the data of the full articulamentum of the first two in three full articulamentums As the second body space-time initial characteristicses, the body space-time initial characteristicses are subjected to principal component analysis, so as to realize that space turns Change and dimensionality reduction, obtain the second body image space-time characteristic.

Wherein, the step S3 further comprises：

The second facial image space-time characteristic and the second body image space-time characteristic series connection are input to full connection In neutral net, output result is input in SVMs, obtains the second facial image space-time characteristic and described the After the fusion of two body image space-time characteristics, belong to the probability matrix of different emotions type, this probability matrix is general labeled as the 5th Rate matrix, while the second facial image space-time characteristic and the second body image space-time characteristic series connection are input to support In vector machine, after obtaining the second facial image space-time characteristic and the second body image space-time characteristic fusion, belong to not The probability matrix of feeling of sympathy type, this probability matrix is labeled as the 6th probability matrix.

Wherein, the step S4 further comprises：

The first facial image space-time characteristic is input in SVMs, obtains the first facial image space-time Feature belongs to the probability matrix of different emotions type, this probability matrix is labeled as into the 3rd probability matrix, by first body Characteristics of image is input in SVMs, obtains the probability that the first body image space-time characteristic belongs to different emotions type Matrix, this probability matrix is labeled as the 4th probability matrix, by the 5th probability matrix, the 6th probability matrix, the 7th probability Matrix and the 8th probability matrix carry out Decision fusion, obtain the second fusion probability matrix；

Described first fusion probability matrix and the second fusion probability matrix are subjected to Decision fusion, obtain the 3rd fusion Probability matrix, using probability highest emotion type in the described 3rd fusion probability matrix as emotion recognition result.

Method provided by the invention, using the emotion identification method of multi-modal combination, take full advantage of in band detection video The effective information of various mode, fusion efficiencies are improved, while improve the accuracy to emotion recognition.

Brief description of the drawings

Fig. 1 is a kind of flow chart for multi-modal emotion recognition sorting technique that one embodiment of the invention provides；

Fig. 2 is to be used in a kind of multi-modal emotion recognition sorting technique that one embodiment of the invention provides based on time series The emotion recognition rate comparison diagram of different convergence strategies；

Fig. 3 is to space-time characteristic extraction in a kind of multi-modal emotion recognition sorting technique that one embodiment of the invention provides Neural network structure schematic diagram；

Fig. 4 is to time sequence in a kind of multi-modal emotion recognition sorting technique that one embodiment of the invention provides using sliding window The schematic diagram of column split；

Fig. 5 is to be based on time series fragment in a kind of multi-modal emotion recognition sorting technique that one embodiment of the invention provides Using the emotion recognition rate comparison diagram of different convergence strategies；

Fig. 6 be one embodiment of the invention provide a kind of multi-modal emotion recognition sorting technique in based on time series and when Between the emotion recognition rate comparison diagram that is merged of sequence fragment.

Embodiment

With reference to the accompanying drawings and examples, the embodiment of the present invention is described in further detail.Implement below Example is used to illustrate the present invention, but is not limited to the scope of the present invention.

With reference to figure 1, Fig. 1 is a kind of flow chart for multi-modal emotion recognition sorting technique that one embodiment of the invention provides, Methods described includes：

S1, receive testing data, and the testing data includes the video comprising face and corresponding comprising body action Video, the video comprising face and the corresponding video comprising body action are pre-processed, when obtaining facial image Between sequence and body image time series.

Specifically, by receiving comprising the video of countenance comprising people and regarding comprising body action in the same time Frequently, by after video pre-filtering, the video of video and body action to face arranges according to picture frame respectively, obtain by regarding The facial image time series and body image time series of picture frame composition in frequency.

By the method, video data is converted into picture frame sequence, improves the operability to data, it is convenient follow-up Data are handled.

S2, the facial image time series is sequentially inputted to the convolutional neural networks based on Alexnet and is based on In BLSTM Recognition with Recurrent Neural Network, the data of output are taken out, as the first facial image space-time characteristic, by the body image Time series is sequentially inputted in the convolutional neural networks based on Alexnet and the Recognition with Recurrent Neural Network based on BLSTM, is taken out defeated The data gone out, as the first body image space-time characteristic.

Specifically, the facial image time series obtained in S1 and body image time series are input to and trained respectively Based on Alexnet convolutional neural networks neutralize the Recognition with Recurrent Neural Network based on BLSTM in, by based on Alexnet convolution god The feature spatially of temporal sequence of images can be obtained from the time series through network, and can by Recognition with Recurrent Neural Network Further to obtain the feature in temporal sequence of images on space-time in the space characteristics of acquisition.In the present embodiment, pass through difference Facial image time series and body image time series are input in the convolutional neural networks based on Alexnet trained In the Recognition with Recurrent Neural Network based on BLSTM, space-time characteristic i.e. the first face of facial image Time-space serial can be obtained respectively Image space-time characteristic and the space-time characteristic of body image sequence are the first body image space-time characteristic.

By the method, the convolutional neural networks combined based on Alexnet and the circulation god based on BLSTM are built Depth network through network, the local and global space-time characteristic of extraction so that can according to more layer depth space-time characteristics of acquisition, The facial image time series and body image time series are classified.

S3, the first facial image space-time characteristic and the first body image space-time characteristic series connection are input to and connected entirely Connect in neutral net, output result is input in SVMs, obtain the first facial image space-time characteristic and described After the fusion of first body image space-time characteristic, belong to the probability matrix of different emotions type, this probability matrix is labeled as first Probability matrix, while the first facial image space-time characteristic and the first body image space-time characteristic series connection are input to branch Hold in vector machine, after obtaining the first facial image space-time characteristic and the first body image space-time characteristic series connection, belong to The probability matrix of different emotions type, this probability matrix is labeled as the second probability matrix.

Specifically, the first facial image space-time characteristic and the first body image space-time characteristic are connected, input Into the full Connection Neural Network trained, output result is input in the SVMs trained, can be according to described First facial image space-time characteristic and the first body image space-time characteristic both modalities which combination, obtain the first face figure As the assemblage characteristic of space-time characteristic and the first body image space-time characteristic belongs to the probability of different emotions classification, structure first Class probability matrix.

Wherein, in the output data of full Connection Neural Network, it is preferred that the data of inverted second full articulamentum are entered Row principal component analysis carries out dimensionality reduction, then by the data input after processing into the SVMs trained, to obtain precision more High probabilistic classification result.

On the other hand, by the way that the first facial image space-time characteristic and the first body image space-time characteristic are carried out Feature after series connection, is then input in the SVMs trained, it is hereby achieved that the first face figure by series connection As the assemblage characteristic of space-time characteristic and the first body image space-time characteristic belongs to the probability of different emotions classification, structure second Class probability matrix.

Wherein, by the cascade process of the first facial image space-time characteristic and the first body image space-time characteristic In, dimensionality reduction can be carried out by principal component analysis to the feature after series connection, then the feature after dimensionality reduction is input to the branch trained Hold in vector machine so as to obtain probability output.By the method, by feature-based fusion, feature and body action to face Feature is merged, and includes neural network fusion strategy and feature series connection convergence strategy, Ke Yifen using different convergence strategies Not Huo get video data belong to the probability matrix of different emotions classification.

By the method, by feature-based fusion, the feature of feature and body action to face merges, using not Same convergence strategy includes neural network fusion strategy and feature series connection convergence strategy, can obtain video data respectively and belong to not The probability matrix of feeling of sympathy classification.

Specifically, the first facial image space-time characteristic is individually entered in the SVMs trained, so as to The probability matrix that the first facial image space-time characteristic belongs to different emotions classification can be obtained, passes through this probability matrix, structure The 3rd probability matrix is built, on the other hand, the first body image space-time characteristic is individually entered to the supporting vector trained In machine, it is hereby achieved that the first body image space-time characteristic belongs to the probability matrix of different emotions classification, it is general by this Rate matrix builds the 4th probability matrix.

With reference to figure 2, Fig. 2 is to be based on the time in a kind of multi-modal emotion recognition sorting technique that one embodiment of the invention provides Sequence uses the emotion recognition rate comparison diagram of different convergence strategies, and four probability matrixs of acquisition are carried out into Decision fusion, obtained Probability matrix after new fusion, include the collection that testing data belongs to the probability of different emotional categories in the probability matrix Close, in this set, select probability highest emotional category is as final recognition result.

By the method, by the face image expression of people and in the same period, body action is combined, by making The space-time characteristic that testing data is carried out with deep neural network extracts, during by SVMs according to different convergence strategies pair Empty feature is classified, and so as to finally realize multi-modal emotion recognition, has been made full use of the effective information in each mode, has been carried Emotion recognition accuracy probability is risen.

On the basis of above-described embodiment, also include before the step S1：To the convolutional Neural based on Alexnet Network, the Recognition with Recurrent Neural Network based on BLSTM, full Connection Neural Network and SVMs are trained.

Specifically, by FABO databases, 127 videos are used for the convolutional neural networks based on Alexnet, are based on BLSTM Recognition with Recurrent Neural Network, full Connection Neural Network and SVMs are trained.

Have by using in people's face and body on the image sequence changed, to the convolutional neural networks based on Alexnet It is trained with the Recognition with Recurrent Neural Network based on BLSTM, adjusts network parameter, obtain Feature Selection Model.Use different faces The space-time characteristic of the space-time characteristic body posture of portion's activity is input in SVMs, sentiment classification model.

To the video comprising face and corresponding body action is included on the basis of above-described embodiment, in step S1 Video carry out pretreatment and specifically include：Each two field picture in the video comprising face is carried out at Face datection and alignment Reason, the picture frame after processing is sequentially arranged, obtains the facial image time series；Body action is included to described Video in each two field picture be normalized, the picture frame after processing is arranged sequentially in time, obtain body Temporal sequence of images.

Specifically, by carrying out Face datection operation and alignment to each picture frame in the video comprising face Processing, then by each two field picture after processing, is arranged sequentially in time, so as to obtain facial image time series, The picture frame in the video comprising body action is normalized simultaneously so that the form one of each two field picture frame Cause, then arranged the picture frame group after processing sequentially in time, form body image time series.

Pass through the method so that the form of each two field picture in facial image time series and body image time series It is identical, the operation such as convenient follow-up progress feature extraction.

On the basis of above-described embodiment, the step S1 further comprises：Read in the video comprising face The mark of each picture frame, extraction form facial image time series labeled as the picture frame of beginning, summit and disappearance；Read Take the mark of each picture frame in the video comprising body action, image of the extraction labeled as beginning, summit and disappearance Frame, form body image time series.Wherein, the mark of described image frame include calming down, start, summit and disappearance.

Specifically, in the database of testing data, each frame of video is all marked, and is opened in a facial expressions and acts All picture frames in stage beginning are marked as " starting ", and the period that maximum is reached in facial expressions and acts is labeled as " summit ", by table All picture frames are labeled as " end " in the period of feelings release, and other are not had into the picture frame mark that expression expressed It is designated as " calming down ".

During emotion recognition is carried out using facial image time series and body image time series, it can use The time series of image composition comprising all picture frames, it can also select to be used only in the period that facial expressions and acts reach maximum Picture frame composition time series, it is preferred that abandon facial expressions and acts start before and facial expressions and acts finish later picture frame, only Select the parts of images frame in facial expressions and acts start to finish to carry out classification processing, " beginning ", " summit " will be labeled as and " disappeared The picture frame of mistake " is extracted, makeup time sequence, and so as to lift overall accuracy of identification, table 1 is shown based on difference Picture frame extracting method under by face video carry out emotion recognition result, table 2 is shown to be carried based on different picture frames Take the result for carrying out emotion recognition under method by body action.

Table 1

Time series screening technique	MAA (%)	ACC (%)
			Vertex sequence	55.90	56.84
Beginning-summit-disappearance sequence	57.56	61.11
			Whole cycle all sequences	51.67	53.85

Table 2

Time series screening technique	MAA (%)	ACC (%)
			Vertex sequence	45.88	50.60
Beginning-summit-disappearance sequence	48.98	51.70
			Whole cycle all sequences	44.50	49.77

By Tables 1 and 2 as can be seen that mark is starts in the selecting video ", the picture frame of " summit " and " disappearance " Makeup time sequence carries out emotion recognition and possesses higher discrimination compared to other schemes.Wherein, MAA represents the average standard of macroscopic view True rate, ACC represent overall accuracy rate, and calculation formula is specially：

P_i=TP_i/(TP_i+FP_i)

In formula, s refers to emotional category number, P_iRefer to the precision of the i-th class emotion, i refers to correctly to classify in the i-th class Number, FP_iRefer to the number of mistake classification in the i-th class.

On the basis of above-described embodiment, the step S2 is specifically included：

S22, the first facial image space characteristics are input to based in BLSTM Recognition with Recurrent Neural Network respectively, taken out The data of the full articulamentum of the first two are as face space-time initial characteristicses in three full articulamentums, by the face space-time initial characteristicses Principal component analysis is carried out, so as to realize space conversion and dimensionality reduction, the first facial image space-time characteristic is obtained, by first body Image space feature is input to based in BLSTM Recognition with Recurrent Neural Network, takes out the full articulamentum of the first two in three full articulamentums The body space-time initial characteristicses are carried out principal component analysis, so as to realize that space turns by data as body space-time initial characteristicses Change and dimensionality reduction, obtain the first body image space-time characteristic.

Specifically, with reference to figure 3, in order to obtain more layer depths in facial image time series and body image time series Space-time characteristic, it is necessary to realized the feature extraction on image space by means of convolutional neural networks, further using circulation Temporal information in neutral net extraction image sequence, in the present embodiment, by using the convolutional Neural net based on Alexnet Network, the space characteristics in facial image time series and body image time series are extracted respectively, it is preferred that based on In Alexnet convolutional neural networks, last three layers are all full articulamentum, and the intrinsic dimensionality of output is respectively 1024 dimensions, 512 dimensions With 10 dimensions, here using first 2 layers in three full articulamentums of output data as the initial space feature for going out output, extract herein Initial characteristicses dimension one share 1536 dimensions, by this 1536 dimensional feature carry out principal component analysis, so as to realize space conversion and drop Dimension processing so that latitude reaches the input standard of the Recognition with Recurrent Neural Network based on BLSTM, then by last three full articulamentums First 2 layers of output data is extracted as initial space-time characteristic, wherein initial space-time characteristic is also 1536 dimensions, then to initial space-time characteristic 1536 dimensional feature points carry out principal component analysis, so as to realize space conversion and dimension-reduction treatment, finally obtain space-time characteristic.At this In one step, by the convolutional neural networks based on Alexnet that are sequentially inputted to train by facial image time series and The Recognition with Recurrent Neural Network based on BLSTM trained, so as to obtain facial image space-time characteristic, likewise, during by body image Between the sequence convolutional neural networks based on Alexnet for being sequentially inputted to train and the god of the circulation based on BLSTM trained Through network, so as to obtain body image space-time characteristic, labeled as the first facial image space-time characteristic and the first body image space-time Feature.

By the method, realize and the extraction of space characteristics and the extraction of temporal characteristics are carried out to the time series of image.

On the basis of the various embodiments described above, the step S1 also includes：By default sliding window length, to described Facial image time series and the body image time series are cut, and are obtained by multiple facial image time series fragments The body image time subsequence of the facial image time subsequence group of composition and multiple body image time series fragments composition Group.

Specifically, after facial image time series and body image time series is obtained, window has been preset by one The sliding window of mouth length, cuts to time series, as shown in figure 4, in the facial image time series that a length is 15, Include that 5 two field picture frame flags are " beginning ", 5 two field picture frame flags are " summit ", 5 two field pictures are labeled as " disappearance ", by setting It is 6 to put length, and sliding step is 1 sliding window, and sequence is cut, and the facial image time series that length is 15 is by above-mentioned The facial image time series fragment that 10 length are 6 can be obtained after the sliding window of setting, forms facial image time subsequence The length of group, wherein sliding window, which tries one's best to be defined into ensure to cut in obtained time series fragment, includes " beginning ", " summit " The picture frame of at least two types, also cuts to body image time series in the picture frame of " end " three types Cut, the body image time series fragment obtained after cutting is formed into body fractional time subsequence group.

Table 3 shows the emotion recognition result carried out under different sliding window length based on facial image time series, table 4 Show the emotion recognition result carried out under different sliding window length based on body image time series.

Table 3

t	6	7	8	9	10
						MAA (%)	58.61	60.45	67.09	58.48	56.13
ACC (%)	59.00	61.25	66.46	59.03	57.21

Table 4

t	6	7	8	9	10
						MAA (%)	43.66	55.00	50.20	47.33	45.81
ACC (%)	44.85	55.98	51.83	48.76	46.00

By table 3 and table 4 as can be seen that when sliding window length selects suitable length, the accurate rate of identification is higher than The emotion recognition mode cut using whole time series in Tables 1 and 2 without time series.

On the basis of the various embodiments described above, the step S2 further comprises：By the facial image chronon sequence Multiple facial image time series fragments in row group are sequentially inputted to the convolutional neural networks based on Alexnet and are based on In BLSTM Recognition with Recurrent Neural Network, the second facial image space-time characteristic is obtained；By in the body image time subsequence group Multiple body image time series fragments are sequentially inputted to the convolutional neural networks based on Alexnet and the circulation based on BLSTM In neutral net, the second body image space-time characteristic is obtained.

Specifically, by multiple facial image time series fragments in the facial image time subsequence group and the body Multiple body image time series fragments in body image temporal subsequence group be also fed to train based on Alexnet's In convolutional neural networks and Recognition with Recurrent Neural Network based on BLSTM, institute is obtained in facial image time subsequence group respectively sometimes Between the space-time characteristic of sequence fragment and the space-time characteristic of all time series fragments in body action image temporal subsequence group, mark It is designated as the second facial image space-time characteristic and the second body action image space-time characteristic.

By the method, feature extraction is entered to multiple time series fragments after cutting, when can obtain new facial image Empty feature and new body action image space-time characteristic, for being classified to grader.

On the basis of the various embodiments described above, also include in the step S2：

Specifically, the method one with extracting the first face space-time characteristic and the first body space-time characteristic in above-described embodiment Cause, in order to obtain the space-time characteristic of more layer depths in facial image time series and body image time series, it is necessary to by The feature extraction on image space is realized in convolutional neural networks, is further extracted using Recognition with Recurrent Neural Network in image Temporal information, in the present embodiment, by using the convolutional neural networks based on Alexnet and the circulation nerve net based on BLSTM Network has carried out the space-time characteristic for all time series fragments that sliding window is cut to extract, so as to extract the second face space-time spy Seek peace the second body space-time characteristic.It is same as the previously described embodiments to the extracting mode of feature in neutral net herein, herein no longer Repeat.

On the basis of the various embodiments described above, the step S3 further comprises：By the second facial image space-time Feature and the second body image space-time characteristic series connection are input in full Connection Neural Network, and output result is input into support In vector machine, after obtaining the second facial image space-time characteristic and the second body image space-time characteristic fusion, belong to not The probability matrix of feeling of sympathy type, this probability matrix is labeled as the 5th probability matrix, while during by second facial image Empty feature and the second body image space-time characteristic series connection are input in SVMs, when obtaining second facial image After empty feature and the second body image space-time characteristic series connection, belong to the probability matrix of different emotions type, by this probability square Battle array is labeled as the 6th probability matrix.

Specifically, the second facial image space-time characteristic and the second body image space-time characteristic are connected, input Into the full Connection Neural Network trained, using the data of the full articulamentum of penultimate in full Connection Neural Network as output Data, after carrying out principal component analysis, it is input in the SVMs trained, so as to according to the second facial image space-time Feature and the second body image space-time characteristic both modalities which combination, obtain the second facial image space-time characteristic and described Second body image space-time characteristic belongs to the probability of different emotions classification, builds the 5th class probability matrix.

On the other hand, by the way that the second facial image space-time characteristic and the second body image space-time characteristic are carried out Feature after series connection, is then input in the SVMs trained by series connection, it is hereby achieved that the second people after series connection Face image space-time characteristic and the second body image space-time characteristic belong to the probability of different emotions classification, by this probabilistic combination, Build the 6th class probability matrix.

On the basis of above-described embodiment, the step S4 further comprises：The first facial image space-time is special Sign is input in SVMs, obtains the probability matrix that the first facial image space-time characteristic belongs to different emotions type, This probability matrix is labeled as the 7th probability matrix, the first body image feature is input in SVMs, is obtained The first body image space-time characteristic belongs to the probability matrix of different emotions type, and this probability matrix is labeled as into the 8th probability Matrix, the 5th probability matrix, the 6th probability matrix, the 7th probability matrix and the 8th probability matrix are subjected to Decision fusion, Obtain the second fusion probability matrix；Described first fusion probability matrix and the second fusion probability matrix are carried out decision-making and melted Close, obtain the 3rd fusion probability matrix, merge probability highest emotion type in probability matrix using the described 3rd identifies as emotion As a result

Specifically, the second facial image space-time characteristic is individually entered in the SVMs trained, so as to The probability matrix that the second facial image space-time characteristic belongs to different emotions classification can be obtained, this probability matrix is labeled as 7th general probability matrix, on the other hand, the second body image space-time characteristic is individually entered to the supporting vector trained In machine, it is hereby achieved that the second body image space-time characteristic belongs to the probability matrix of different emotions classification, by this probability Matrix is labeled as the 8th probability matrix.

With reference to figure 5, as can be seen from Figure 5 based on the 5th probability matrix, the 6th probability matrix, the 7th probability matrix and The discrimination contrast of emotion recognition is carried out in eight probability matrixs, by the 5th probability matrix, the 6th probability matrix, the 7th probability Matrix and the 8th probability matrix carry out Decision fusion, the fusion probability matrix of generation second, can reach in Fig. 5 and show in Multi4-2 The affective style discrimination gone out.

Enter finally by described first fusion probability matrix is merged into probability matrix with described second further according to Probabilistic Decision-making Row decision level fusion, obtain the 3rd fusion probability matrix, in this set, select probability highest emotional category as finally Recognition result.With reference to figure 6, Fig. 6 is shown in the first fusion probability matrix to the discrimination of emotion, and the second fusion probability matrix In to the discrimination of emotion, using whole time series and carrying out sliding window in the discrimination of emotion and the 3rd fusion probability Time series fragment group after cutting carries out after emotion recognition again merged recognition result, to obtain accuracy and existing respectively More than 99% emotion recognition.

By the method, using the emotion identification method of multi-modal combination, various moulds in band detection video are taken full advantage of The effective information of state, fusion efficiencies are improved, while improve the accuracy to emotion recognition.

Finally, the present processes are only preferable embodiment, are not intended to limit the scope of the present invention.It is all Within the spirit and principles in the present invention, any modification, equivalent substitution and improvements made etc., the protection of the present invention should be included in Within the scope of.

Claims

A kind of 1. multi-modal emotion recognition sorting technique, it is characterised in that including：

S1, receives testing data, and the testing data includes the video comprising face and body is included in the corresponding same time The video of action, the video comprising face and the corresponding video comprising body action are pre-processed, comprising The facial image time series of face and the body image time series comprising body action；

S2, the facial image time series is sequentially inputted to convolutional neural networks based on Alexnet and based on BLSTM's In Recognition with Recurrent Neural Network, the data of output are taken out, as the first facial image space-time characteristic, by the body image time series It is sequentially inputted in the convolutional neural networks based on Alexnet and the Recognition with Recurrent Neural Network based on BLSTM, takes out the number of output According to as the first body image space-time characteristic；

S3, the first facial image space-time characteristic and the first body image space-time characteristic series connection are input to full connection god Through in network, output result is input in SVMs, the first facial image space-time characteristic and described first are obtained After the fusion of body image space-time characteristic, belong to the probability matrix of different emotions type, this probability matrix is labeled as the first probability Matrix, at the same by the first facial image space-time characteristic and the first body image space-time characteristic series connection be input to support to In amount machine, after obtaining the first facial image space-time characteristic and the first body image space-time characteristic series connection, belong to different The probability matrix of affective style, this probability matrix is labeled as the second probability matrix；

S4, the first facial image space-time characteristic is input in SVMs, obtains the first facial image space-time Feature belongs to the probability matrix of different emotions type, this probability matrix is labeled as into the 3rd probability matrix, by first body Characteristics of image is input in SVMs, obtains the probability that the first body image space-time characteristic belongs to different emotions type Matrix, this probability matrix is labeled as the 4th probability matrix, by first probability matrix, the second probability matrix, the 3rd probability Matrix and the 4th probability matrix carry out Decision fusion, obtain the first fusion probability matrix, described first is merged in probability matrix Probability highest emotion type is as emotion recognition result.
2. according to the method for claim 1, it is characterised in that also include before the step S1：It is based on to described Alexnet convolutional neural networks, the Recognition with Recurrent Neural Network based on BLSTM, full Connection Neural Network and SVMs enter Row training.
3. according to the method for claim 1, it is characterised in that to the video comprising face and corresponding in step S1 Video comprising body action carries out pretreatment and specifically included：

Face datection and registration process are carried out to each two field picture in the video comprising face, the picture frame after processing is pressed Time sequencing arranges, and obtains the facial image time series；

Each two field picture in entering to the video comprising body action is normalized, and the picture frame after processing is pressed Arranged according to time sequencing, obtain body image time series.
4. according to the method for claim 3, it is characterised in that the step S1 further comprises：

The mark of each picture frame, figure of the extraction labeled as beginning, summit and disappearance in the reading video comprising face As frame, facial image time series is formed；

The mark of each picture frame in the reading video comprising body action, extraction is labeled as beginning, summit and disappearance Picture frame, form body image time series；

Wherein, the mark of described image frame include calming down, start, summit and disappearance.
5. according to the method for claim 1, it is characterised in that the step S2 is specifically included：

S21, the facial image time series is input in the convolutional neural networks based on Alexnet, takes out three and connect entirely The data of the full articulamentum of the first two in layer are connect as face space initial characteristicses, by face space initial characteristicses carry out it is main into Analysis, so as to realize space conversion and dimensionality reduction, the first facial image space characteristics are obtained, by the body image time series It is input in the convolutional neural networks based on Alexnet, takes out the data conduct of the full articulamentum of the first two in three full articulamentums Body space initial characteristicses, the body space initial characteristicses are subjected to principal component analysis, so as to realize space conversion and dimensionality reduction, Obtain the first body image space characteristics；

S22, the first facial image space characteristics are input to based in BLSTM Recognition with Recurrent Neural Network, three is taken out and connects entirely The data of the full articulamentum of the first two in layer are connect as face space-time initial characteristicses, by the face space-time initial characteristicses carry out it is main into Analysis, so as to realize space conversion and dimensionality reduction, the first facial image space-time characteristic is obtained, by the first body image space Feature is input to based in BLSTM Recognition with Recurrent Neural Network, takes out the data conduct of the full articulamentum of the first two in three full articulamentums Body space-time initial characteristicses, the body space-time initial characteristicses are subjected to principal component analysis, so as to realize space conversion and dimensionality reduction, Obtain the first body image space-time characteristic.
6. according to any described method in claim 1-5, it is characterised in that also include in the step S1：

By default sliding window length, the facial image time series and the body image time series are cut Cut, obtain the facial image time subsequence group being made up of multiple facial image time series fragments and multiple body image times The body image time subsequence group of sequence fragment composition.
7. according to the method for claim 6, it is characterised in that the step S2 further comprises：

Multiple facial image time series fragments in the facial image time subsequence group are sequentially inputted to be based on In Alexnet convolutional neural networks and Recognition with Recurrent Neural Network based on BLSTM, the data of output are taken out, as the second face Image space-time characteristic；

Multiple body image time series fragments in the body image time subsequence group are sequentially inputted to be based on In Alexnet convolutional neural networks and Recognition with Recurrent Neural Network based on BLSTM, the data of output are taken out, as the second body Image space-time characteristic.
8. according to the method for claim 7, it is characterised in that also include in the step S2：

Multiple facial image time serieses in the facial image time subsequence group are input to the convolution based on Alexnet In neutral net, the data of the full articulamentum of the first two in three full articulamentums are taken out as the second face space initial characteristicses, will The second face space initial characteristicses carry out principal component analysis, so as to realize space conversion and dimensionality reduction, obtain the second face figure Image space feature, multiple body image time serieses in the body image time subsequence group are input to and are based on In Alexnet convolutional neural networks, the data for taking out the full articulamentum of the first two in three full articulamentums are empty as the second body Between initial characteristicses, by the second body space initial characteristicses carry out principal component analysis, so as to realize space conversion and dimensionality reduction, obtain Obtain the second body image space characteristics；

The second facial image space characteristics are input to based in BLSTM Recognition with Recurrent Neural Network, take out three full articulamentums The data of the middle full articulamentum of the first two as the second face space-time initial characteristicses, by the face space-time initial characteristicses carry out it is main into Analysis, so as to realize space conversion and dimensionality reduction, the second facial image space-time characteristic is obtained, by the second body image space Feature is input to based in BLSTM Recognition with Recurrent Neural Network, takes out the data conduct of the full articulamentum of the first two in three full articulamentums Second body space-time initial characteristicses, by the body space-time initial characteristicses carry out principal component analysis, so as to realize space conversion and Dimensionality reduction, obtain the second body image space-time characteristic.
9. according to the method for claim 8, it is characterised in that the step S3 further comprises：

The second facial image space-time characteristic and the second body image space-time characteristic series connection are input to full connection nerve In network, output result is input in SVMs, obtains the second facial image space-time characteristic and second body After the fusion of body image space-time characteristic, belong to the probability matrix of different emotions type, this probability matrix is labeled as the 5th probability square Battle array, while the second facial image space-time characteristic and the second body image space-time characteristic series connection are input to supporting vector In machine, after obtaining the second facial image space-time characteristic and the second body image space-time characteristic series connection, belong to and do not sympathize with Feel the probability matrix of type, this probability matrix is labeled as the 6th probability matrix.
10. according to the method for claim 9, it is characterised in that the step S4 further comprises：

The first facial image space-time characteristic is input in SVMs, obtains the first facial image space-time characteristic Belong to the probability matrix of different emotions type, this probability matrix is labeled as the 3rd probability matrix, by first body image Feature is input in SVMs, obtains the probability square that the first body image space-time characteristic belongs to different emotions type Battle array, is labeled as the 4th probability matrix, by the 5th probability matrix, the 6th probability matrix, the 7th probability square by this probability matrix Battle array and the 8th probability matrix carry out Decision fusion, obtain the second fusion probability matrix；

Described first fusion probability matrix and the second fusion probability matrix are subjected to Decision fusion, obtain the 3rd fusion probability Matrix, using probability highest emotion type in the described 3rd fusion probability matrix as emotion recognition result.