CN109344781A

CN109344781A - Expression recognition method in a kind of video based on audio visual union feature

Info

Publication number: CN109344781A
Application number: CN201811182972.1A
Authority: CN
Inventors: 张奕; 谢锦滨; 顾寅铮
Original assignee: Shanghai Jilian Network Technology Co Ltd
Current assignee: Shanghai Jilian Network Technology Co Ltd
Priority date: 2018-10-11
Filing date: 2018-10-11
Publication date: 2019-02-15

Abstract

The present invention discloses expression recognition method in a kind of video based on audio visual union feature, method includes the following steps: step S1: sampling in two dimensions of vision and sound to input video, obtain sampled image frames and sampled audio segment；Step S2: carrying out Visual Feature Retrieval Process in sampled image frames, obtains visual feature vector, and sound characteristic extraction is carried out in sampled sound segment, obtains sound characteristic vector；Step S3: fusion vision and sound feature vector, design joint classification device classify to vision sound union feature, obtain expression detection classification results.

Description

Expression recognition method in a kind of video based on audio visual union feature

Technical field

The present invention relates to expression recognition method in a kind of video more particularly to a kind of views based on audio visual union feature Expression recognition method in frequency.

Background technique

Expression Recognition in video is the technology for judging its expression according to the character features occurred in video.In video often See and important expression classification includes happy, indignation, detest, frightened, sad, surprised etc..Expression is the important set of video content It at part, by identifying expression, can be analyzed with corresponding emotion mood expressed by video clip, to derive and feelings Feel relevant Video Applications.

Expression Recognition technology focuses on the mode based on face visual signature mostly in existing video, i.e., is examined by face Position, analysis and identification human face region image are measured, expression classification is carried out to it according to the visual signature of human face region image.Face Area image visual signature is strictly the visual signature that can most reflect human face expression, but since facial image has fuzzy, illumination The interference of the factors such as condition, angle deflection, being based only upon the facial expression recognitions of the single modal characteristics of vision, there are certain limitations Property.But can reflect that the information of expression is not limited merely to visual signature in video, sound characteristic is also a kind of energy reflecting video The important feature of emotion can analyze the emotion attribute of video clip by sound characteristic, to help expression in video Identification improves accuracy rate.It is problem to be solved how by visual signature and sound characteristic effective integration.

Summary of the invention

It is it is an object of the invention to be analyzed using sound characteristic model video feeling, sound characteristic and vision is special The a variety of expression classifications occurred in video are carried out detection identification by sign joint modeling.Its core is that design a kind of audio visual more Modal characteristics coalition framework makes to complement one another between each modal characteristics, makes up the deficiency of single features mode.

In order to achieve the goal above, Expression Recognition in a kind of video based on audio visual union feature provided by the invention Method is divided into following steps:

Step S1: input video is sampled in two dimensions of vision and sound, obtains sampled image frames and sampling Audio fragment；

Step S2: carrying out Visual Feature Retrieval Process in sampled image frames, visual feature vector is obtained, in sampled sound segment Upper progress sound characteristic extraction obtains sound characteristic vector；

Step S3: fusion vision and sound feature vector, design joint classification device divide vision sound union feature Class obtains expression detection classification results.

Wherein, equal interval sampling is all made of in two dimensions of input video vision and sound.

Wherein, visual signature is obtained using housebroken convolutional neural networks in sampled image frames, the network training data For the face image data through marking expression classification.

Wherein, it is obtained using the sound characteristic on sound clip using housebroken convolutional neural networks, the network training Data are the sound clip comprising emotional speech through marking emotional category.

Wherein, vision and sound feature vector are merged, is carried out by the way of monolayer neural networks, by learning vision Feature and sound characteristic obtain final mapping function and classification results to weight is mapped between each expression classification.

Advantages of the present invention and technical effect: from specific embodiment it can be seen that advantages of the present invention and technical effect, fill Divide using the sound and visual information for including in video, is effectively combined it by neural network, establish union feature and mould Type makes up the respective deficiency of single modal characteristics, achievees the effect that promote Expression Recognition accuracy rate in video.

Detailed description of the invention

The present invention is based on the basic procedures of Expression Recognition in the video of audio visual union feature by Fig. 1.

Specific embodiment

Each detailed problem involved in technical solution is described in detail with reference to the accompanying drawing.It should be noted that being retouched The embodiment stated is intended to convenient for the understanding of the present invention, and does not play any restriction effect to it.

Implementing procedure of the invention is as shown in Figure 1:

The embodiment of the present invention first samples video, and sampling is divided into two mode of image and sound.

Image sampling uses 2.56 seconds equal interval samplings, obtains sample frame.

Sampled voice carries out equal interval sampling to audio for interval with 20 milliseconds, obtains the audio fragment of 20 milliseconds of length.

Sampled images pass through following pre-treatment step: using bibliography [1] (Zhang, K., Zhang, Z., Li, Z., and Qiao,Y.(2016).Joint face detection and alignment using multitask cascaded Convolutional networks.IEEE Signal Processing Letters, 23 (10): method 1499-1503) Face frame and characteristic point in detection image simultaneously carry out posture alignment, the facial image after being aligned.

Sampled audio segment passes through following pre-treatment step: carrying out spectrum analysis, amount of frequency spectrum to each sampled audio segment 128 frequency ranges are turned to, every 128 sampled points are a set of samples, and each sampling clip durations are * 128=2.56 seconds 0.02 second, structure The spectral response figure tieed up at 128*128.

Image convolution neural network is trained using the Facial Expression Image data set through marking, and network structure is 50 layers Resnet。

Emotion audio data set of the sound convolutional neural networks sampling through marking is trained, mark class label and figure As the human face expression one-to-one correspondence in data, network structure also uses 50 layers of Resnet.

Sampled image frames are input to image convolution neural network after pretreatment, extract 1000 pool5 layers of output conduct of dimension The corresponding visual feature vector of sampled images.

Sampled audio segment is input to sound convolutional neural networks after pretreatment, extracts 1000 pool5 layers of output of dimension and makees For the corresponding sound characteristic vector of sampled audio segment.

Connection merges visual feature vector and sound feature vector, ties up and normalizes through PCA principle component analysis dimensionality reduction to 512 Afterwards, as the audio visual union feature vector of the sampling.

With expression classifier of the supervised learning method training based on audio visual union feature vector, training sample is simultaneously The expression class label of video clip and mark comprising human face expression and sound, the optional SVM, XGBoost of classifier pattern, list Common supervised learning classifiers such as the full Connection Neural Network of layer or combinations thereof, by the audio visual union feature of sampling when reasoning Vector input classifier can be obtained the corresponding expression classification of sampling.

Claims

1. expression recognition method in a kind of video based on audio visual union feature, it is characterised in that:

The following steps are included:

Step S1: input video is sampled in two dimensions of vision and sound, obtains sampled image frames and sampled audio Segment；

Step S2: carrying out Visual Feature Retrieval Process in sampled image frames, obtains visual feature vector, enterprising in sampled sound segment Row sound characteristic extracts, and obtains sound characteristic vector；

Step S3: fusion vision and sound feature vector, design joint classification device are classified to vision sound union feature, are obtained Classification results are detected to expression.

2. expression recognition method in the video as described in claim 1 based on audio visual union feature, it is characterised in that: view Expression Recognition uses the combined sampling of visual pattern frame and sound clip in frequency, and two kinds of sampling sampling intervals having the same are with full Alignment in sufficient time-domain.

3. expression recognition method in the video as claimed in claim 2 based on audio visual union feature, it is characterised in that: sound Sound feature, which is used, inputs the characteristic layer output that pre-trained sound convolutional neural networks obtain for isometric interval audible spectrum figure, Visual signature is pre-trained using inputting the sampled images that isometric interval sampling obtains after Face datection alignment pretreatment The characteristic layer output that vision convolutional neural networks obtain, sound characteristic and visual signature merge by connection, dimensionality reduction normalizes etc. Union feature vector is obtained after the processing such as transformation.

4. expression recognition method in the video as claimed in claim 3 based on audio visual union feature, it is characterised in that: make With the sample set of audio visual joint mark, trained after extracting audio visual union feature vector with the expression label of mark Supervised classifier realizes the expression classification in video.

5. expression recognition method in the video as claimed in claim 4 based on audio visual union feature, it is characterised in that: use Supervised learning method trains the expression classifier based on audio visual union feature vector, and training sample is while including face table The expression class label of the video clip and mark of feelings and sound, the choosing of classifier pattern include being not limited to SVM, XGBoost, single layer Full Connection Neural Network supervised learning classifier or combinations thereof, by the audio visual union feature vector input point of sampling when reasoning Class device can be obtained the corresponding expression classification of sampling.

6. expression recognition method in the video as claimed in claim 5 based on audio visual union feature, it is characterised in that: figure As sampling uses 2.56 seconds equal interval samplings, acquisition sample frame；Sampled voice carries out at equal intervals audio for interval with 20 milliseconds Sampling, obtains the audio fragment of 20 milliseconds of length.

7. expression recognition method in the video as claimed in claim 6 based on audio visual union feature, it is characterised in that: adopt Face frame and characteristic point after sampled images in detection image simultaneously carry out posture alignment, the facial image after being aligned；Sample sound Frequency segment passes through following pre-treatment step: spectrum analysis is carried out to each sampled audio segment, spectrum quantification is 128 frequency ranges, Every 128 sampled points are a set of samples, and each sampling clip durations are * 128=2.56 seconds 0.02 second, constitute 128*128 dimension Spectral response figure.

8. expression recognition method in the video as claimed in claim 7 based on audio visual union feature, it is characterised in that: figure As convolutional neural networks are trained using the Facial Expression Image data set through marking, network structure is 50 layers of Resnet；Sound Emotion audio data set of the sound convolutional neural networks sampling through marking is trained, in mark class label and image data Human face expression corresponds, and network structure also uses 50 layers of Resnet.

9. expression recognition method in the video as claimed in claim 8 based on audio visual union feature, it is characterised in that: adopt Sampled images frame is input to image convolution neural network after pretreatment, extracts the pool5 layers of output of 1000 dimension and is used as sampled images pair The visual feature vector answered；Sampled audio segment is input to sound convolutional neural networks after pretreatment, extracts 1000 dimension pool5 Layer output is used as the corresponding sound characteristic vector of sampled audio segment；Connection merges visual feature vector and sound feature vector, Audio visual union feature vector after PCA principle component analysis dimensionality reduction to 512 is tieed up and normalized, as the sampling.