CN109344781A - Expression recognition method in a kind of video based on audio visual union feature - Google Patents

Expression recognition method in a kind of video based on audio visual union feature Download PDF

Info

Publication number
CN109344781A
CN109344781A CN201811182972.1A CN201811182972A CN109344781A CN 109344781 A CN109344781 A CN 109344781A CN 201811182972 A CN201811182972 A CN 201811182972A CN 109344781 A CN109344781 A CN 109344781A
Authority
CN
China
Prior art keywords
sound
sampled
video
audio
expression
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811182972.1A
Other languages
Chinese (zh)
Inventor
张奕
谢锦滨
顾寅铮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jilian Network Technology Co Ltd
Original Assignee
Shanghai Jilian Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jilian Network Technology Co Ltd filed Critical Shanghai Jilian Network Technology Co Ltd
Priority to CN201811182972.1A priority Critical patent/CN109344781A/en
Publication of CN109344781A publication Critical patent/CN109344781A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The present invention discloses expression recognition method in a kind of video based on audio visual union feature, method includes the following steps: step S1: sampling in two dimensions of vision and sound to input video, obtain sampled image frames and sampled audio segment;Step S2: carrying out Visual Feature Retrieval Process in sampled image frames, obtains visual feature vector, and sound characteristic extraction is carried out in sampled sound segment, obtains sound characteristic vector;Step S3: fusion vision and sound feature vector, design joint classification device classify to vision sound union feature, obtain expression detection classification results.

Description

Expression recognition method in a kind of video based on audio visual union feature
Technical field
The present invention relates to expression recognition method in a kind of video more particularly to a kind of views based on audio visual union feature Expression recognition method in frequency.
Background technique
Expression Recognition in video is the technology for judging its expression according to the character features occurred in video.In video often See and important expression classification includes happy, indignation, detest, frightened, sad, surprised etc..Expression is the important set of video content It at part, by identifying expression, can be analyzed with corresponding emotion mood expressed by video clip, to derive and feelings Feel relevant Video Applications.
Expression Recognition technology focuses on the mode based on face visual signature mostly in existing video, i.e., is examined by face Position, analysis and identification human face region image are measured, expression classification is carried out to it according to the visual signature of human face region image.Face Area image visual signature is strictly the visual signature that can most reflect human face expression, but since facial image has fuzzy, illumination The interference of the factors such as condition, angle deflection, being based only upon the facial expression recognitions of the single modal characteristics of vision, there are certain limitations Property.But can reflect that the information of expression is not limited merely to visual signature in video, sound characteristic is also a kind of energy reflecting video The important feature of emotion can analyze the emotion attribute of video clip by sound characteristic, to help expression in video Identification improves accuracy rate.It is problem to be solved how by visual signature and sound characteristic effective integration.
Summary of the invention
It is it is an object of the invention to be analyzed using sound characteristic model video feeling, sound characteristic and vision is special The a variety of expression classifications occurred in video are carried out detection identification by sign joint modeling.Its core is that design a kind of audio visual more Modal characteristics coalition framework makes to complement one another between each modal characteristics, makes up the deficiency of single features mode.
In order to achieve the goal above, Expression Recognition in a kind of video based on audio visual union feature provided by the invention Method is divided into following steps:
Step S1: input video is sampled in two dimensions of vision and sound, obtains sampled image frames and sampling Audio fragment;
Step S2: carrying out Visual Feature Retrieval Process in sampled image frames, visual feature vector is obtained, in sampled sound segment Upper progress sound characteristic extraction obtains sound characteristic vector;
Step S3: fusion vision and sound feature vector, design joint classification device divide vision sound union feature Class obtains expression detection classification results.
Wherein, equal interval sampling is all made of in two dimensions of input video vision and sound.
Wherein, visual signature is obtained using housebroken convolutional neural networks in sampled image frames, the network training data For the face image data through marking expression classification.
Wherein, it is obtained using the sound characteristic on sound clip using housebroken convolutional neural networks, the network training Data are the sound clip comprising emotional speech through marking emotional category.
Wherein, vision and sound feature vector are merged, is carried out by the way of monolayer neural networks, by learning vision Feature and sound characteristic obtain final mapping function and classification results to weight is mapped between each expression classification.
Advantages of the present invention and technical effect: from specific embodiment it can be seen that advantages of the present invention and technical effect, fill Divide using the sound and visual information for including in video, is effectively combined it by neural network, establish union feature and mould Type makes up the respective deficiency of single modal characteristics, achievees the effect that promote Expression Recognition accuracy rate in video.
Detailed description of the invention
The present invention is based on the basic procedures of Expression Recognition in the video of audio visual union feature by Fig. 1.
Specific embodiment
Each detailed problem involved in technical solution is described in detail with reference to the accompanying drawing.It should be noted that being retouched The embodiment stated is intended to convenient for the understanding of the present invention, and does not play any restriction effect to it.
Implementing procedure of the invention is as shown in Figure 1:
The embodiment of the present invention first samples video, and sampling is divided into two mode of image and sound.
Image sampling uses 2.56 seconds equal interval samplings, obtains sample frame.
Sampled voice carries out equal interval sampling to audio for interval with 20 milliseconds, obtains the audio fragment of 20 milliseconds of length.
Sampled images pass through following pre-treatment step: using bibliography [1] (Zhang, K., Zhang, Z., Li, Z., and Qiao,Y.(2016).Joint face detection and alignment using multitask cascaded Convolutional networks.IEEE Signal Processing Letters, 23 (10): method 1499-1503) Face frame and characteristic point in detection image simultaneously carry out posture alignment, the facial image after being aligned.
Sampled audio segment passes through following pre-treatment step: carrying out spectrum analysis, amount of frequency spectrum to each sampled audio segment 128 frequency ranges are turned to, every 128 sampled points are a set of samples, and each sampling clip durations are * 128=2.56 seconds 0.02 second, structure The spectral response figure tieed up at 128*128.
Image convolution neural network is trained using the Facial Expression Image data set through marking, and network structure is 50 layers Resnet。
Emotion audio data set of the sound convolutional neural networks sampling through marking is trained, mark class label and figure As the human face expression one-to-one correspondence in data, network structure also uses 50 layers of Resnet.
Sampled image frames are input to image convolution neural network after pretreatment, extract 1000 pool5 layers of output conduct of dimension The corresponding visual feature vector of sampled images.
Sampled audio segment is input to sound convolutional neural networks after pretreatment, extracts 1000 pool5 layers of output of dimension and makees For the corresponding sound characteristic vector of sampled audio segment.
Connection merges visual feature vector and sound feature vector, ties up and normalizes through PCA principle component analysis dimensionality reduction to 512 Afterwards, as the audio visual union feature vector of the sampling.
With expression classifier of the supervised learning method training based on audio visual union feature vector, training sample is simultaneously The expression class label of video clip and mark comprising human face expression and sound, the optional SVM, XGBoost of classifier pattern, list Common supervised learning classifiers such as the full Connection Neural Network of layer or combinations thereof, by the audio visual union feature of sampling when reasoning Vector input classifier can be obtained the corresponding expression classification of sampling.

Claims (9)

1. expression recognition method in a kind of video based on audio visual union feature, it is characterised in that:
The following steps are included:
Step S1: input video is sampled in two dimensions of vision and sound, obtains sampled image frames and sampled audio Segment;
Step S2: carrying out Visual Feature Retrieval Process in sampled image frames, obtains visual feature vector, enterprising in sampled sound segment Row sound characteristic extracts, and obtains sound characteristic vector;
Step S3: fusion vision and sound feature vector, design joint classification device are classified to vision sound union feature, are obtained Classification results are detected to expression.
2. expression recognition method in the video as described in claim 1 based on audio visual union feature, it is characterised in that: view Expression Recognition uses the combined sampling of visual pattern frame and sound clip in frequency, and two kinds of sampling sampling intervals having the same are with full Alignment in sufficient time-domain.
3. expression recognition method in the video as claimed in claim 2 based on audio visual union feature, it is characterised in that: sound Sound feature, which is used, inputs the characteristic layer output that pre-trained sound convolutional neural networks obtain for isometric interval audible spectrum figure, Visual signature is pre-trained using inputting the sampled images that isometric interval sampling obtains after Face datection alignment pretreatment The characteristic layer output that vision convolutional neural networks obtain, sound characteristic and visual signature merge by connection, dimensionality reduction normalizes etc. Union feature vector is obtained after the processing such as transformation.
4. expression recognition method in the video as claimed in claim 3 based on audio visual union feature, it is characterised in that: make With the sample set of audio visual joint mark, trained after extracting audio visual union feature vector with the expression label of mark Supervised classifier realizes the expression classification in video.
5. expression recognition method in the video as claimed in claim 4 based on audio visual union feature, it is characterised in that: use Supervised learning method trains the expression classifier based on audio visual union feature vector, and training sample is while including face table The expression class label of the video clip and mark of feelings and sound, the choosing of classifier pattern include being not limited to SVM, XGBoost, single layer Full Connection Neural Network supervised learning classifier or combinations thereof, by the audio visual union feature vector input point of sampling when reasoning Class device can be obtained the corresponding expression classification of sampling.
6. expression recognition method in the video as claimed in claim 5 based on audio visual union feature, it is characterised in that: figure As sampling uses 2.56 seconds equal interval samplings, acquisition sample frame;Sampled voice carries out at equal intervals audio for interval with 20 milliseconds Sampling, obtains the audio fragment of 20 milliseconds of length.
7. expression recognition method in the video as claimed in claim 6 based on audio visual union feature, it is characterised in that: adopt Face frame and characteristic point after sampled images in detection image simultaneously carry out posture alignment, the facial image after being aligned;Sample sound Frequency segment passes through following pre-treatment step: spectrum analysis is carried out to each sampled audio segment, spectrum quantification is 128 frequency ranges, Every 128 sampled points are a set of samples, and each sampling clip durations are * 128=2.56 seconds 0.02 second, constitute 128*128 dimension Spectral response figure.
8. expression recognition method in the video as claimed in claim 7 based on audio visual union feature, it is characterised in that: figure As convolutional neural networks are trained using the Facial Expression Image data set through marking, network structure is 50 layers of Resnet;Sound Emotion audio data set of the sound convolutional neural networks sampling through marking is trained, in mark class label and image data Human face expression corresponds, and network structure also uses 50 layers of Resnet.
9. expression recognition method in the video as claimed in claim 8 based on audio visual union feature, it is characterised in that: adopt Sampled images frame is input to image convolution neural network after pretreatment, extracts the pool5 layers of output of 1000 dimension and is used as sampled images pair The visual feature vector answered;Sampled audio segment is input to sound convolutional neural networks after pretreatment, extracts 1000 dimension pool5 Layer output is used as the corresponding sound characteristic vector of sampled audio segment;Connection merges visual feature vector and sound feature vector, Audio visual union feature vector after PCA principle component analysis dimensionality reduction to 512 is tieed up and normalized, as the sampling.
CN201811182972.1A 2018-10-11 2018-10-11 Expression recognition method in a kind of video based on audio visual union feature Pending CN109344781A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811182972.1A CN109344781A (en) 2018-10-11 2018-10-11 Expression recognition method in a kind of video based on audio visual union feature

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811182972.1A CN109344781A (en) 2018-10-11 2018-10-11 Expression recognition method in a kind of video based on audio visual union feature

Publications (1)

Publication Number Publication Date
CN109344781A true CN109344781A (en) 2019-02-15

Family

ID=65309445

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811182972.1A Pending CN109344781A (en) 2018-10-11 2018-10-11 Expression recognition method in a kind of video based on audio visual union feature

Country Status (1)

Country Link
CN (1) CN109344781A (en)

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110363074A (en) * 2019-06-03 2019-10-22 华南理工大学 One kind identifying exchange method for complicated abstract class of things peopleization
CN110717470A (en) * 2019-10-16 2020-01-21 上海极链网络科技有限公司 Scene recognition method and device, computer equipment and storage medium
CN110942011A (en) * 2019-11-18 2020-03-31 上海极链网络科技有限公司 Video event identification method, system, electronic equipment and medium
CN110971969A (en) * 2019-12-09 2020-04-07 北京字节跳动网络技术有限公司 Video dubbing method and device, electronic equipment and computer readable storage medium
CN111163366A (en) * 2019-12-30 2020-05-15 厦门市美亚柏科信息股份有限公司 Video processing method and terminal
CN111401259A (en) * 2020-03-18 2020-07-10 南京星火技术有限公司 Model training method, system, computer readable medium and electronic device
CN111507421A (en) * 2020-04-22 2020-08-07 上海极链网络科技有限公司 Video-based emotion recognition method and device
CN111723239A (en) * 2020-05-11 2020-09-29 华中科技大学 Multi-mode-based video annotation method
WO2020248376A1 (en) * 2019-06-14 2020-12-17 平安科技(深圳)有限公司 Emotion detection method and apparatus, electronic device, and storage medium
CN112328830A (en) * 2019-08-05 2021-02-05 Tcl集团股份有限公司 Information positioning method based on deep learning and related equipment
CN112699774A (en) * 2020-12-28 2021-04-23 深延科技(北京)有限公司 Method and device for recognizing emotion of person in video, computer equipment and medium
WO2021138855A1 (en) * 2020-01-08 2021-07-15 深圳市欢太科技有限公司 Model training method, video processing method and apparatus, storage medium and electronic device
WO2021147084A1 (en) * 2020-01-23 2021-07-29 Beijing Didi Infinity Technology And Development Co., Ltd. Systems and methods for emotion recognition in user-generated video(ugv)
TWI760671B (en) * 2019-09-27 2022-04-11 大陸商深圳市商湯科技有限公司 A kind of audio and video information processing method and device, electronic device and computer-readable storage medium
CN114330453A (en) * 2022-01-05 2022-04-12 东北农业大学 Live pig cough sound identification method based on fusion of acoustic features and visual features
CN114581749A (en) * 2022-05-09 2022-06-03 城云科技(中国)有限公司 Audio-visual feature fusion target behavior identification method and device and application
CN114581570A (en) * 2022-03-01 2022-06-03 浙江同花顺智能科技有限公司 Three-dimensional face action generation method and system
US11494477B2 (en) 2019-04-12 2022-11-08 Coupang Corp. Computerized systems and methods for determining authenticity using micro expressions

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105469065A (en) * 2015-12-07 2016-04-06 中国科学院自动化研究所 Recurrent neural network-based discrete emotion recognition method
CN105740767A (en) * 2016-01-22 2016-07-06 江苏大学 Driver road rage real-time identification and warning method based on facial features
CN106803098A (en) * 2016-12-28 2017-06-06 南京邮电大学 A kind of three mode emotion identification methods based on voice, expression and attitude
CN106878677A (en) * 2017-01-23 2017-06-20 西安电子科技大学 Student classroom Grasping level assessment system and method based on multisensor

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105469065A (en) * 2015-12-07 2016-04-06 中国科学院自动化研究所 Recurrent neural network-based discrete emotion recognition method
CN105740767A (en) * 2016-01-22 2016-07-06 江苏大学 Driver road rage real-time identification and warning method based on facial features
CN106803098A (en) * 2016-12-28 2017-06-06 南京邮电大学 A kind of three mode emotion identification methods based on voice, expression and attitude
CN106878677A (en) * 2017-01-23 2017-06-20 西安电子科技大学 Student classroom Grasping level assessment system and method based on multisensor

Cited By (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11494477B2 (en) 2019-04-12 2022-11-08 Coupang Corp. Computerized systems and methods for determining authenticity using micro expressions
CN110363074A (en) * 2019-06-03 2019-10-22 华南理工大学 One kind identifying exchange method for complicated abstract class of things peopleization
WO2020248376A1 (en) * 2019-06-14 2020-12-17 平安科技(深圳)有限公司 Emotion detection method and apparatus, electronic device, and storage medium
CN112328830A (en) * 2019-08-05 2021-02-05 Tcl集团股份有限公司 Information positioning method based on deep learning and related equipment
TWI760671B (en) * 2019-09-27 2022-04-11 大陸商深圳市商湯科技有限公司 A kind of audio and video information processing method and device, electronic device and computer-readable storage medium
CN110717470A (en) * 2019-10-16 2020-01-21 上海极链网络科技有限公司 Scene recognition method and device, computer equipment and storage medium
CN110717470B (en) * 2019-10-16 2023-09-26 山东瑞瀚网络科技有限公司 Scene recognition method and device, computer equipment and storage medium
CN110942011A (en) * 2019-11-18 2020-03-31 上海极链网络科技有限公司 Video event identification method, system, electronic equipment and medium
CN110971969A (en) * 2019-12-09 2020-04-07 北京字节跳动网络技术有限公司 Video dubbing method and device, electronic equipment and computer readable storage medium
CN111163366A (en) * 2019-12-30 2020-05-15 厦门市美亚柏科信息股份有限公司 Video processing method and terminal
WO2021138855A1 (en) * 2020-01-08 2021-07-15 深圳市欢太科技有限公司 Model training method, video processing method and apparatus, storage medium and electronic device
WO2021147084A1 (en) * 2020-01-23 2021-07-29 Beijing Didi Infinity Technology And Development Co., Ltd. Systems and methods for emotion recognition in user-generated video(ugv)
CN111401259A (en) * 2020-03-18 2020-07-10 南京星火技术有限公司 Model training method, system, computer readable medium and electronic device
CN111401259B (en) * 2020-03-18 2024-02-02 南京星火技术有限公司 Model training method, system, computer readable medium and electronic device
CN111507421A (en) * 2020-04-22 2020-08-07 上海极链网络科技有限公司 Video-based emotion recognition method and device
CN111723239A (en) * 2020-05-11 2020-09-29 华中科技大学 Multi-mode-based video annotation method
CN111723239B (en) * 2020-05-11 2023-06-16 华中科技大学 Video annotation method based on multiple modes
CN112699774A (en) * 2020-12-28 2021-04-23 深延科技(北京)有限公司 Method and device for recognizing emotion of person in video, computer equipment and medium
CN112699774B (en) * 2020-12-28 2024-05-24 深延科技(北京)有限公司 Emotion recognition method and device for characters in video, computer equipment and medium
CN114330453A (en) * 2022-01-05 2022-04-12 东北农业大学 Live pig cough sound identification method based on fusion of acoustic features and visual features
CN114581570B (en) * 2022-03-01 2024-01-26 浙江同花顺智能科技有限公司 Three-dimensional face action generation method and system
CN114581570A (en) * 2022-03-01 2022-06-03 浙江同花顺智能科技有限公司 Three-dimensional face action generation method and system
CN114581749A (en) * 2022-05-09 2022-06-03 城云科技(中国)有限公司 Audio-visual feature fusion target behavior identification method and device and application
WO2023216609A1 (en) * 2022-05-09 2023-11-16 城云科技(中国)有限公司 Target behavior recognition method and apparatus based on visual-audio feature fusion, and application
CN114581749B (en) * 2022-05-09 2022-07-26 城云科技(中国)有限公司 Audio-visual feature fusion target behavior identification method and device and application

Similar Documents

Publication Publication Date Title
CN109344781A (en) Expression recognition method in a kind of video based on audio visual union feature
CN112699774A (en) Method and device for recognizing emotion of person in video, computer equipment and medium
CN112418172A (en) Multimode information fusion emotion analysis method based on multimode information intelligent processing unit
CN108597501A (en) A kind of audio-visual speech model based on residual error network and bidirectional valve controlled cycling element
Goyal et al. Real-life facial expression recognition systems: a review
Gokilavani et al. Ravdness, crema-d, tess based algorithm for emotion recognition using speech
Shinde et al. Real time two way communication approach for hearing impaired and dumb person based on image processing
Parvini et al. An approach to glove-based gesture recognition
CN104091150B (en) A kind of human eye state judgment method based on recurrence
CN107944363A (en) Face image processing process, system and server
Marras et al. Deep multi-biometric fusion for audio-visual user re-identification and verification
Krupa et al. Emotion aware smart music recommender system using two level CNN
Sasidharan Rajeswari et al. Speech Emotion Recognition Using Machine Learning Techniques
Shrivastava et al. Puzzling out emotions: a deep-learning approach to multimodal sentiment analysis
Lungociu REAL TIME SIGN LANGUAGE RECOGNITION USING ARTIFICIAL NEURAL NETWORKS.
Sharma et al. Gesture recognition system
Kartik et al. Multimodal biometric person authentication system using speech and signature features
Tu et al. Bimodal emotion recognition based on speech signals and facial expression
Bora et al. ISL gesture recognition using multiple feature fusion
Jimoh et al. Offline gesture recognition system for yorùbá numeral counting
CN107492384B (en) Voice emotion recognition method based on fuzzy nearest neighbor algorithm
Shibata et al. Basic investigation for improvement of sign language recognition using classification scheme
Khanum et al. Emotion recognition using multi-modal features and CNN classification
Asawa et al. Recognition of emotions using energy based bimodal information fusion and correlation
Thosar et al. Review on mood detection using image processing and chatbot using artificial intelligence

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20190215