CN109344781A - Expression recognition method in a kind of video based on audio visual union feature - Google Patents
Expression recognition method in a kind of video based on audio visual union feature Download PDFInfo
- Publication number
- CN109344781A CN109344781A CN201811182972.1A CN201811182972A CN109344781A CN 109344781 A CN109344781 A CN 109344781A CN 201811182972 A CN201811182972 A CN 201811182972A CN 109344781 A CN109344781 A CN 109344781A
- Authority
- CN
- China
- Prior art keywords
- sound
- sampled
- video
- audio
- expression
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/41—Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Bioinformatics & Computational Biology (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- Multimedia (AREA)
- Image Analysis (AREA)
Abstract
The present invention discloses expression recognition method in a kind of video based on audio visual union feature, method includes the following steps: step S1: sampling in two dimensions of vision and sound to input video, obtain sampled image frames and sampled audio segment;Step S2: carrying out Visual Feature Retrieval Process in sampled image frames, obtains visual feature vector, and sound characteristic extraction is carried out in sampled sound segment, obtains sound characteristic vector;Step S3: fusion vision and sound feature vector, design joint classification device classify to vision sound union feature, obtain expression detection classification results.
Description
Technical field
The present invention relates to expression recognition method in a kind of video more particularly to a kind of views based on audio visual union feature
Expression recognition method in frequency.
Background technique
Expression Recognition in video is the technology for judging its expression according to the character features occurred in video.In video often
See and important expression classification includes happy, indignation, detest, frightened, sad, surprised etc..Expression is the important set of video content
It at part, by identifying expression, can be analyzed with corresponding emotion mood expressed by video clip, to derive and feelings
Feel relevant Video Applications.
Expression Recognition technology focuses on the mode based on face visual signature mostly in existing video, i.e., is examined by face
Position, analysis and identification human face region image are measured, expression classification is carried out to it according to the visual signature of human face region image.Face
Area image visual signature is strictly the visual signature that can most reflect human face expression, but since facial image has fuzzy, illumination
The interference of the factors such as condition, angle deflection, being based only upon the facial expression recognitions of the single modal characteristics of vision, there are certain limitations
Property.But can reflect that the information of expression is not limited merely to visual signature in video, sound characteristic is also a kind of energy reflecting video
The important feature of emotion can analyze the emotion attribute of video clip by sound characteristic, to help expression in video
Identification improves accuracy rate.It is problem to be solved how by visual signature and sound characteristic effective integration.
Summary of the invention
It is it is an object of the invention to be analyzed using sound characteristic model video feeling, sound characteristic and vision is special
The a variety of expression classifications occurred in video are carried out detection identification by sign joint modeling.Its core is that design a kind of audio visual more
Modal characteristics coalition framework makes to complement one another between each modal characteristics, makes up the deficiency of single features mode.
In order to achieve the goal above, Expression Recognition in a kind of video based on audio visual union feature provided by the invention
Method is divided into following steps:
Step S1: input video is sampled in two dimensions of vision and sound, obtains sampled image frames and sampling
Audio fragment;
Step S2: carrying out Visual Feature Retrieval Process in sampled image frames, visual feature vector is obtained, in sampled sound segment
Upper progress sound characteristic extraction obtains sound characteristic vector;
Step S3: fusion vision and sound feature vector, design joint classification device divide vision sound union feature
Class obtains expression detection classification results.
Wherein, equal interval sampling is all made of in two dimensions of input video vision and sound.
Wherein, visual signature is obtained using housebroken convolutional neural networks in sampled image frames, the network training data
For the face image data through marking expression classification.
Wherein, it is obtained using the sound characteristic on sound clip using housebroken convolutional neural networks, the network training
Data are the sound clip comprising emotional speech through marking emotional category.
Wherein, vision and sound feature vector are merged, is carried out by the way of monolayer neural networks, by learning vision
Feature and sound characteristic obtain final mapping function and classification results to weight is mapped between each expression classification.
Advantages of the present invention and technical effect: from specific embodiment it can be seen that advantages of the present invention and technical effect, fill
Divide using the sound and visual information for including in video, is effectively combined it by neural network, establish union feature and mould
Type makes up the respective deficiency of single modal characteristics, achievees the effect that promote Expression Recognition accuracy rate in video.
Detailed description of the invention
The present invention is based on the basic procedures of Expression Recognition in the video of audio visual union feature by Fig. 1.
Specific embodiment
Each detailed problem involved in technical solution is described in detail with reference to the accompanying drawing.It should be noted that being retouched
The embodiment stated is intended to convenient for the understanding of the present invention, and does not play any restriction effect to it.
Implementing procedure of the invention is as shown in Figure 1:
The embodiment of the present invention first samples video, and sampling is divided into two mode of image and sound.
Image sampling uses 2.56 seconds equal interval samplings, obtains sample frame.
Sampled voice carries out equal interval sampling to audio for interval with 20 milliseconds, obtains the audio fragment of 20 milliseconds of length.
Sampled images pass through following pre-treatment step: using bibliography [1] (Zhang, K., Zhang, Z., Li, Z.,
and Qiao,Y.(2016).Joint face detection and alignment using multitask cascaded
Convolutional networks.IEEE Signal Processing Letters, 23 (10): method 1499-1503)
Face frame and characteristic point in detection image simultaneously carry out posture alignment, the facial image after being aligned.
Sampled audio segment passes through following pre-treatment step: carrying out spectrum analysis, amount of frequency spectrum to each sampled audio segment
128 frequency ranges are turned to, every 128 sampled points are a set of samples, and each sampling clip durations are * 128=2.56 seconds 0.02 second, structure
The spectral response figure tieed up at 128*128.
Image convolution neural network is trained using the Facial Expression Image data set through marking, and network structure is 50 layers
Resnet。
Emotion audio data set of the sound convolutional neural networks sampling through marking is trained, mark class label and figure
As the human face expression one-to-one correspondence in data, network structure also uses 50 layers of Resnet.
Sampled image frames are input to image convolution neural network after pretreatment, extract 1000 pool5 layers of output conduct of dimension
The corresponding visual feature vector of sampled images.
Sampled audio segment is input to sound convolutional neural networks after pretreatment, extracts 1000 pool5 layers of output of dimension and makees
For the corresponding sound characteristic vector of sampled audio segment.
Connection merges visual feature vector and sound feature vector, ties up and normalizes through PCA principle component analysis dimensionality reduction to 512
Afterwards, as the audio visual union feature vector of the sampling.
With expression classifier of the supervised learning method training based on audio visual union feature vector, training sample is simultaneously
The expression class label of video clip and mark comprising human face expression and sound, the optional SVM, XGBoost of classifier pattern, list
Common supervised learning classifiers such as the full Connection Neural Network of layer or combinations thereof, by the audio visual union feature of sampling when reasoning
Vector input classifier can be obtained the corresponding expression classification of sampling.
Claims (9)
1. expression recognition method in a kind of video based on audio visual union feature, it is characterised in that:
The following steps are included:
Step S1: input video is sampled in two dimensions of vision and sound, obtains sampled image frames and sampled audio
Segment;
Step S2: carrying out Visual Feature Retrieval Process in sampled image frames, obtains visual feature vector, enterprising in sampled sound segment
Row sound characteristic extracts, and obtains sound characteristic vector;
Step S3: fusion vision and sound feature vector, design joint classification device are classified to vision sound union feature, are obtained
Classification results are detected to expression.
2. expression recognition method in the video as described in claim 1 based on audio visual union feature, it is characterised in that: view
Expression Recognition uses the combined sampling of visual pattern frame and sound clip in frequency, and two kinds of sampling sampling intervals having the same are with full
Alignment in sufficient time-domain.
3. expression recognition method in the video as claimed in claim 2 based on audio visual union feature, it is characterised in that: sound
Sound feature, which is used, inputs the characteristic layer output that pre-trained sound convolutional neural networks obtain for isometric interval audible spectrum figure,
Visual signature is pre-trained using inputting the sampled images that isometric interval sampling obtains after Face datection alignment pretreatment
The characteristic layer output that vision convolutional neural networks obtain, sound characteristic and visual signature merge by connection, dimensionality reduction normalizes etc.
Union feature vector is obtained after the processing such as transformation.
4. expression recognition method in the video as claimed in claim 3 based on audio visual union feature, it is characterised in that: make
With the sample set of audio visual joint mark, trained after extracting audio visual union feature vector with the expression label of mark
Supervised classifier realizes the expression classification in video.
5. expression recognition method in the video as claimed in claim 4 based on audio visual union feature, it is characterised in that: use
Supervised learning method trains the expression classifier based on audio visual union feature vector, and training sample is while including face table
The expression class label of the video clip and mark of feelings and sound, the choosing of classifier pattern include being not limited to SVM, XGBoost, single layer
Full Connection Neural Network supervised learning classifier or combinations thereof, by the audio visual union feature vector input point of sampling when reasoning
Class device can be obtained the corresponding expression classification of sampling.
6. expression recognition method in the video as claimed in claim 5 based on audio visual union feature, it is characterised in that: figure
As sampling uses 2.56 seconds equal interval samplings, acquisition sample frame;Sampled voice carries out at equal intervals audio for interval with 20 milliseconds
Sampling, obtains the audio fragment of 20 milliseconds of length.
7. expression recognition method in the video as claimed in claim 6 based on audio visual union feature, it is characterised in that: adopt
Face frame and characteristic point after sampled images in detection image simultaneously carry out posture alignment, the facial image after being aligned;Sample sound
Frequency segment passes through following pre-treatment step: spectrum analysis is carried out to each sampled audio segment, spectrum quantification is 128 frequency ranges,
Every 128 sampled points are a set of samples, and each sampling clip durations are * 128=2.56 seconds 0.02 second, constitute 128*128 dimension
Spectral response figure.
8. expression recognition method in the video as claimed in claim 7 based on audio visual union feature, it is characterised in that: figure
As convolutional neural networks are trained using the Facial Expression Image data set through marking, network structure is 50 layers of Resnet;Sound
Emotion audio data set of the sound convolutional neural networks sampling through marking is trained, in mark class label and image data
Human face expression corresponds, and network structure also uses 50 layers of Resnet.
9. expression recognition method in the video as claimed in claim 8 based on audio visual union feature, it is characterised in that: adopt
Sampled images frame is input to image convolution neural network after pretreatment, extracts the pool5 layers of output of 1000 dimension and is used as sampled images pair
The visual feature vector answered;Sampled audio segment is input to sound convolutional neural networks after pretreatment, extracts 1000 dimension pool5
Layer output is used as the corresponding sound characteristic vector of sampled audio segment;Connection merges visual feature vector and sound feature vector,
Audio visual union feature vector after PCA principle component analysis dimensionality reduction to 512 is tieed up and normalized, as the sampling.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811182972.1A CN109344781A (en) | 2018-10-11 | 2018-10-11 | Expression recognition method in a kind of video based on audio visual union feature |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811182972.1A CN109344781A (en) | 2018-10-11 | 2018-10-11 | Expression recognition method in a kind of video based on audio visual union feature |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109344781A true CN109344781A (en) | 2019-02-15 |
Family
ID=65309445
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811182972.1A Pending CN109344781A (en) | 2018-10-11 | 2018-10-11 | Expression recognition method in a kind of video based on audio visual union feature |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109344781A (en) |
Cited By (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110363074A (en) * | 2019-06-03 | 2019-10-22 | 华南理工大学 | One kind identifying exchange method for complicated abstract class of things peopleization |
CN110717470A (en) * | 2019-10-16 | 2020-01-21 | 上海极链网络科技有限公司 | Scene recognition method and device, computer equipment and storage medium |
CN110942011A (en) * | 2019-11-18 | 2020-03-31 | 上海极链网络科技有限公司 | Video event identification method, system, electronic equipment and medium |
CN110971969A (en) * | 2019-12-09 | 2020-04-07 | 北京字节跳动网络技术有限公司 | Video dubbing method and device, electronic equipment and computer readable storage medium |
CN111163366A (en) * | 2019-12-30 | 2020-05-15 | 厦门市美亚柏科信息股份有限公司 | Video processing method and terminal |
CN111401259A (en) * | 2020-03-18 | 2020-07-10 | 南京星火技术有限公司 | Model training method, system, computer readable medium and electronic device |
CN111507421A (en) * | 2020-04-22 | 2020-08-07 | 上海极链网络科技有限公司 | Video-based emotion recognition method and device |
CN111723239A (en) * | 2020-05-11 | 2020-09-29 | 华中科技大学 | Multi-mode-based video annotation method |
WO2020248376A1 (en) * | 2019-06-14 | 2020-12-17 | 平安科技(深圳)有限公司 | Emotion detection method and apparatus, electronic device, and storage medium |
CN112328830A (en) * | 2019-08-05 | 2021-02-05 | Tcl集团股份有限公司 | Information positioning method based on deep learning and related equipment |
CN112699774A (en) * | 2020-12-28 | 2021-04-23 | 深延科技(北京)有限公司 | Method and device for recognizing emotion of person in video, computer equipment and medium |
WO2021138855A1 (en) * | 2020-01-08 | 2021-07-15 | 深圳市欢太科技有限公司 | Model training method, video processing method and apparatus, storage medium and electronic device |
WO2021147084A1 (en) * | 2020-01-23 | 2021-07-29 | Beijing Didi Infinity Technology And Development Co., Ltd. | Systems and methods for emotion recognition in user-generated video(ugv) |
TWI760671B (en) * | 2019-09-27 | 2022-04-11 | 大陸商深圳市商湯科技有限公司 | A kind of audio and video information processing method and device, electronic device and computer-readable storage medium |
CN114330453A (en) * | 2022-01-05 | 2022-04-12 | 东北农业大学 | Live pig cough sound identification method based on fusion of acoustic features and visual features |
CN114581749A (en) * | 2022-05-09 | 2022-06-03 | 城云科技(中国)有限公司 | Audio-visual feature fusion target behavior identification method and device and application |
CN114581570A (en) * | 2022-03-01 | 2022-06-03 | 浙江同花顺智能科技有限公司 | Three-dimensional face action generation method and system |
US11494477B2 (en) | 2019-04-12 | 2022-11-08 | Coupang Corp. | Computerized systems and methods for determining authenticity using micro expressions |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105469065A (en) * | 2015-12-07 | 2016-04-06 | 中国科学院自动化研究所 | Recurrent neural network-based discrete emotion recognition method |
CN105740767A (en) * | 2016-01-22 | 2016-07-06 | 江苏大学 | Driver road rage real-time identification and warning method based on facial features |
CN106803098A (en) * | 2016-12-28 | 2017-06-06 | 南京邮电大学 | A kind of three mode emotion identification methods based on voice, expression and attitude |
CN106878677A (en) * | 2017-01-23 | 2017-06-20 | 西安电子科技大学 | Student classroom Grasping level assessment system and method based on multisensor |
-
2018
- 2018-10-11 CN CN201811182972.1A patent/CN109344781A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105469065A (en) * | 2015-12-07 | 2016-04-06 | 中国科学院自动化研究所 | Recurrent neural network-based discrete emotion recognition method |
CN105740767A (en) * | 2016-01-22 | 2016-07-06 | 江苏大学 | Driver road rage real-time identification and warning method based on facial features |
CN106803098A (en) * | 2016-12-28 | 2017-06-06 | 南京邮电大学 | A kind of three mode emotion identification methods based on voice, expression and attitude |
CN106878677A (en) * | 2017-01-23 | 2017-06-20 | 西安电子科技大学 | Student classroom Grasping level assessment system and method based on multisensor |
Cited By (25)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11494477B2 (en) | 2019-04-12 | 2022-11-08 | Coupang Corp. | Computerized systems and methods for determining authenticity using micro expressions |
CN110363074A (en) * | 2019-06-03 | 2019-10-22 | 华南理工大学 | One kind identifying exchange method for complicated abstract class of things peopleization |
WO2020248376A1 (en) * | 2019-06-14 | 2020-12-17 | 平安科技(深圳)有限公司 | Emotion detection method and apparatus, electronic device, and storage medium |
CN112328830A (en) * | 2019-08-05 | 2021-02-05 | Tcl集团股份有限公司 | Information positioning method based on deep learning and related equipment |
TWI760671B (en) * | 2019-09-27 | 2022-04-11 | 大陸商深圳市商湯科技有限公司 | A kind of audio and video information processing method and device, electronic device and computer-readable storage medium |
CN110717470A (en) * | 2019-10-16 | 2020-01-21 | 上海极链网络科技有限公司 | Scene recognition method and device, computer equipment and storage medium |
CN110717470B (en) * | 2019-10-16 | 2023-09-26 | 山东瑞瀚网络科技有限公司 | Scene recognition method and device, computer equipment and storage medium |
CN110942011A (en) * | 2019-11-18 | 2020-03-31 | 上海极链网络科技有限公司 | Video event identification method, system, electronic equipment and medium |
CN110971969A (en) * | 2019-12-09 | 2020-04-07 | 北京字节跳动网络技术有限公司 | Video dubbing method and device, electronic equipment and computer readable storage medium |
CN111163366A (en) * | 2019-12-30 | 2020-05-15 | 厦门市美亚柏科信息股份有限公司 | Video processing method and terminal |
WO2021138855A1 (en) * | 2020-01-08 | 2021-07-15 | 深圳市欢太科技有限公司 | Model training method, video processing method and apparatus, storage medium and electronic device |
WO2021147084A1 (en) * | 2020-01-23 | 2021-07-29 | Beijing Didi Infinity Technology And Development Co., Ltd. | Systems and methods for emotion recognition in user-generated video(ugv) |
CN111401259A (en) * | 2020-03-18 | 2020-07-10 | 南京星火技术有限公司 | Model training method, system, computer readable medium and electronic device |
CN111401259B (en) * | 2020-03-18 | 2024-02-02 | 南京星火技术有限公司 | Model training method, system, computer readable medium and electronic device |
CN111507421A (en) * | 2020-04-22 | 2020-08-07 | 上海极链网络科技有限公司 | Video-based emotion recognition method and device |
CN111723239A (en) * | 2020-05-11 | 2020-09-29 | 华中科技大学 | Multi-mode-based video annotation method |
CN111723239B (en) * | 2020-05-11 | 2023-06-16 | 华中科技大学 | Video annotation method based on multiple modes |
CN112699774A (en) * | 2020-12-28 | 2021-04-23 | 深延科技(北京)有限公司 | Method and device for recognizing emotion of person in video, computer equipment and medium |
CN112699774B (en) * | 2020-12-28 | 2024-05-24 | 深延科技(北京)有限公司 | Emotion recognition method and device for characters in video, computer equipment and medium |
CN114330453A (en) * | 2022-01-05 | 2022-04-12 | 东北农业大学 | Live pig cough sound identification method based on fusion of acoustic features and visual features |
CN114581570B (en) * | 2022-03-01 | 2024-01-26 | 浙江同花顺智能科技有限公司 | Three-dimensional face action generation method and system |
CN114581570A (en) * | 2022-03-01 | 2022-06-03 | 浙江同花顺智能科技有限公司 | Three-dimensional face action generation method and system |
CN114581749A (en) * | 2022-05-09 | 2022-06-03 | 城云科技(中国)有限公司 | Audio-visual feature fusion target behavior identification method and device and application |
WO2023216609A1 (en) * | 2022-05-09 | 2023-11-16 | 城云科技(中国)有限公司 | Target behavior recognition method and apparatus based on visual-audio feature fusion, and application |
CN114581749B (en) * | 2022-05-09 | 2022-07-26 | 城云科技(中国)有限公司 | Audio-visual feature fusion target behavior identification method and device and application |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109344781A (en) | Expression recognition method in a kind of video based on audio visual union feature | |
CN112699774A (en) | Method and device for recognizing emotion of person in video, computer equipment and medium | |
CN112418172A (en) | Multimode information fusion emotion analysis method based on multimode information intelligent processing unit | |
CN108597501A (en) | A kind of audio-visual speech model based on residual error network and bidirectional valve controlled cycling element | |
Goyal et al. | Real-life facial expression recognition systems: a review | |
Gokilavani et al. | Ravdness, crema-d, tess based algorithm for emotion recognition using speech | |
Shinde et al. | Real time two way communication approach for hearing impaired and dumb person based on image processing | |
Parvini et al. | An approach to glove-based gesture recognition | |
CN104091150B (en) | A kind of human eye state judgment method based on recurrence | |
CN107944363A (en) | Face image processing process, system and server | |
Marras et al. | Deep multi-biometric fusion for audio-visual user re-identification and verification | |
Krupa et al. | Emotion aware smart music recommender system using two level CNN | |
Sasidharan Rajeswari et al. | Speech Emotion Recognition Using Machine Learning Techniques | |
Shrivastava et al. | Puzzling out emotions: a deep-learning approach to multimodal sentiment analysis | |
Lungociu | REAL TIME SIGN LANGUAGE RECOGNITION USING ARTIFICIAL NEURAL NETWORKS. | |
Sharma et al. | Gesture recognition system | |
Kartik et al. | Multimodal biometric person authentication system using speech and signature features | |
Tu et al. | Bimodal emotion recognition based on speech signals and facial expression | |
Bora et al. | ISL gesture recognition using multiple feature fusion | |
Jimoh et al. | Offline gesture recognition system for yorùbá numeral counting | |
CN107492384B (en) | Voice emotion recognition method based on fuzzy nearest neighbor algorithm | |
Shibata et al. | Basic investigation for improvement of sign language recognition using classification scheme | |
Khanum et al. | Emotion recognition using multi-modal features and CNN classification | |
Asawa et al. | Recognition of emotions using energy based bimodal information fusion and correlation | |
Thosar et al. | Review on mood detection using image processing and chatbot using artificial intelligence |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190215 |