CN109409296A - The video feeling recognition methods that facial expression recognition and speech emotion recognition are merged - Google Patents
The video feeling recognition methods that facial expression recognition and speech emotion recognition are merged Download PDFInfo
- Publication number
- CN109409296A CN109409296A CN201811272233.1A CN201811272233A CN109409296A CN 109409296 A CN109409296 A CN 109409296A CN 201811272233 A CN201811272233 A CN 201811272233A CN 109409296 A CN109409296 A CN 109409296A
- Authority
- CN
- China
- Prior art keywords
- formula
- recognition
- feature
- point
- mentioned
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 100
- 230000008921 facial expression Effects 0.000 title claims abstract description 88
- 230000008909 emotion recognition Effects 0.000 title claims abstract description 67
- 230000001815 facial effect Effects 0.000 claims abstract description 59
- 230000008569 process Effects 0.000 claims abstract description 43
- 230000008451 emotion Effects 0.000 claims abstract description 40
- 230000004927 fusion Effects 0.000 claims abstract description 30
- 230000000750 progressive effect Effects 0.000 claims abstract description 13
- 230000014509 gene expression Effects 0.000 claims abstract description 12
- 238000000605 extraction Methods 0.000 claims description 35
- 210000001508 eye Anatomy 0.000 claims description 31
- 230000002996 emotional effect Effects 0.000 claims description 24
- 210000004709 eyebrow Anatomy 0.000 claims description 19
- 230000007935 neutral effect Effects 0.000 claims description 16
- 230000000007 visual effect Effects 0.000 claims description 16
- 238000012216 screening Methods 0.000 claims description 12
- 238000001514 detection method Methods 0.000 claims description 11
- 239000000284 extract Substances 0.000 claims description 11
- 238000004364 calculation method Methods 0.000 claims description 9
- 238000009432 framing Methods 0.000 claims description 9
- 230000000694 effects Effects 0.000 claims description 8
- 230000035945 sensitivity Effects 0.000 claims description 6
- 238000012360 testing method Methods 0.000 claims description 4
- NAWXUBYGYWOOIX-SFHVURJKSA-N (2s)-2-[[4-[2-(2,4-diaminoquinazolin-6-yl)ethyl]benzoyl]amino]-4-methylidenepentanedioic acid Chemical compound C1=CC2=NC(N)=NC(N)=C2C=C1CCC1=CC=C(C(=O)N[C@@H](CC(=C)C(O)=O)C(O)=O)C=C1 NAWXUBYGYWOOIX-SFHVURJKSA-N 0.000 claims description 3
- 241000405217 Viola <butterfly> Species 0.000 claims description 3
- 230000004913 activation Effects 0.000 claims description 3
- 230000008859 change Effects 0.000 claims description 3
- 238000012512 characterization method Methods 0.000 claims description 3
- 230000009977 dual effect Effects 0.000 claims description 3
- 210000002569 neuron Anatomy 0.000 claims description 3
- 210000000214 mouth Anatomy 0.000 claims 7
- 210000001331 nose Anatomy 0.000 claims 4
- 210000005252 bulbus oculi Anatomy 0.000 claims 1
- 238000010606 normalization Methods 0.000 claims 1
- 238000005086 pumping Methods 0.000 claims 1
- 230000033764 rhythmic process Effects 0.000 claims 1
- 238000005516 engineering process Methods 0.000 abstract description 8
- 230000007547 defect Effects 0.000 abstract description 4
- 238000012545 processing Methods 0.000 abstract description 2
- 230000006870 function Effects 0.000 description 8
- 238000013528 artificial neural network Methods 0.000 description 5
- 238000012549 training Methods 0.000 description 5
- 230000002902 bimodal effect Effects 0.000 description 4
- 238000011160 research Methods 0.000 description 4
- 238000005267 amalgamation Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 238000013459 approach Methods 0.000 description 2
- 210000004556 brain Anatomy 0.000 description 2
- 150000001875 compounds Chemical class 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 230000008447 perception Effects 0.000 description 2
- 238000012706 support-vector machine Methods 0.000 description 2
- 238000000540 analysis of variance Methods 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000000295 complement effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000007689 inspection Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 230000001537 neural effect Effects 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 238000007500 overflow downdraw method Methods 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 238000010187 selection method Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/174—Facial expression recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/168—Feature extraction; Face representation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/172—Classification, e.g. identification
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/57—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for processing of video signals
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/63—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- Multimedia (AREA)
- Human Computer Interaction (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Oral & Maxillofacial Surgery (AREA)
- Signal Processing (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Computational Linguistics (AREA)
- Acoustics & Sound (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- Child & Adolescent Psychology (AREA)
- Hospice & Palliative Care (AREA)
- Psychiatry (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Image Analysis (AREA)
Abstract
The present invention is the video feeling recognition methods for merging facial expression recognition and speech emotion recognition, it is related to the processing of the record carrier of figure for identification, it is a kind of audiovisual emotion identification method that two based on decision level process is progressive, this method separates facial expression recognition and the speech emotion recognition in video, using the method for the progressive emotion recognition of two processes, by way of design conditions probability, the technology of speech emotion recognition is carried out on the basis of facial expression recognition;Step is: process A. is using facial image Expression Recognition as first time Classification and Identification;Process B. is using speech emotion recognition as second of Classification and Identification;The fusion of process C. facial expression recognition and speech emotion recognition.The present invention overcomes the prior arts had ignored in human emotion's identification inner link between face characteristic and phonetic feature and video feeling identification recognition speed is slow and defect that discrimination is not high.
Description
Technical field
Technical solution of the present invention is related to the processing of the record carrier of figure for identification, specifically by human face expression
The video feeling recognition methods of identification and speech emotion recognition fusion.
Background technique
With the rapid development of artificial intelligence and computer vision technique, human-computer interaction technology makes rapid progress, and utilizes calculating
Human emotion's identification technology that machine carries out has had received widespread attention, and how computer to be made more rapidly and accurately to identify people
Class emotion becomes the research hotspot of field of machine vision instantly.
The emotional expression mode of the mankind be it is diversified, mainly have human face expression, speech emotional, upper body posture and language
Text etc..Wherein, human face expression and speech emotional are two kinds of emotional expression modes the most typical.Due to the texture of face and several
What feature is easier to extract, based on the emotion identification method of human face expression current emotion recognition field have been able to reach compared with
High discrimination.However, in the similar situation of some expressions, such as angry and detest, fear and surprised, their texture is special
Geometrical characteristic of seeking peace is more similar, and it is not high to carry out knowledge method for distinguishing discrimination only by the feature for extracting human face expression.
Often there is certain limitation, bimodal or multi-modal emotion recognition side in the emotion identification method of single mode
Method increasingly becomes the hot spot of the research of emotion recognition research field and concern.The key of multi-modal emotion identification method is a variety of
The amalgamation mode of mode, wherein the amalgamation mode of mainstream has feature-based fusion mode and decision level fusion mode.
2012, Schuller et al. was in paper " AVEC:the continuous audio/visual emotion
Audio and video feature is cascaded as single feature vector in challenge ", and uses support vector regression SVR as AVEC
Multi-modal feature is carried out directly cascade building union feature vector by the baseline in 2012 challenges, this feature level fusing method.By
Dimension disaster is easily caused in multi-modal feature quantity is huge, high dimensional feature is highly susceptible to the puzzlement of Sparse Problem, examines
Consider the interaction between feature, combining in feature-based fusion mode will be limited the advantages of audio frequency characteristics and video features
System.
Decision level fusion mode refers to that a variety of emotional expression modes can be modeled by corresponding classifier first, then
The recognition result of each classifier is fused together to be formed based on decision level fusion mode, in the case where not increasing dimension, leads to
The contribution degree of different emotional expressions is crossed to combine different modes.Seng et al. is in paper " A combined rule-
By audiovisual emotion in based&machine learning audio-visual emotion recognition approach "
Identification is divided into two mutually independent paths and extracts feature respectively, then models on respective classifier respectively, and it is respectively right to find out
The discrimination answered finally obtains final discrimination according to ratio scoring and corresponding weight distribution.Existing decision level
, mainly there is two o'clock in the shortcomings that amalgamation mode, and first: ratio scoring and weight distribution strategy lack unified authoritative standard, no
Same researcher often obtains in same research project according to the ratio scoring and different weight distribution strategies of multiplicity
Different recognition results;Second: decision level fusion mode stresses recognition of face and the fusion of speech recognition result, has ignored face
Inner link between feature and phonetic feature.
CN106529504A discloses a kind of bimodal video feeling recognition methods of compound space-time characteristic, by existing volume
Local binary patterns algorithm is extended to three value mode of space-time, obtains the three value mode squares of space-time part of human face expression and upper body posture
Textural characteristics further merge three-dimensional gradient direction histogram feature to enhance the description to emotion video, by two kinds of feature groups
Compound space-time characteristic is synthesized, this method exists when personage's upper body attitudes vibration is comparatively fast in video or upper body posture picture lacks
It influences whether the realization of its algorithm, therefore combines the bimodal video feeling recognition methods of human face expression and upper body posture in feature
Aspect is extracted to have some limitations.
CN105512609A disclose it is a kind of transfinited the multi-modal fusion video feeling recognition methods of learning machine based on core, it is right
The image information and audio-frequency information of video carry out feature extraction, feature selecting, to obtain video features;By the multichannel of acquisition
EEG signals are pre-processed, feature extraction and feature selecting, to obtain brain electrical feature;Establishing is transfinited learning machine based on core
Multimodality fusion video feeling identification model;Video features and brain electrical feature are input to be transfinited the multimodality fusion of learning machine based on core
Video feeling identification is carried out in video feeling identification model, obtains final classification accuracy rate.However, the algorithm exists only to three
Class video feeling data classification discrimination is high, and availability owes high defect.
CN103400145B discloses a kind of voice based on clue neural network-vision fusion emotion identification method, should
Method distinguishes frontal one expression, the characteristic in three channels of side countenance and voice of user first, independently
A neural network is trained to execute the identification of discrete emotional category, the output layer in training process in neural network model adds
Enter 4 clue nodes, the hint information of 4 coarseness classifications in liveness-evaluating deg space is carried respectively, then using more
Modality fusion model merges the output result of three neural networks, and multi-modal fusion model is also using based on hint information
Trained neural network, however since in most of videos, face side countenance frame number is less, it is difficult to it carries out effective
Acquisition causes this method to have biggish limitation in practical operation.This method further relates to the training and fusion of neural network,
With the raising of data volume increased with data dimension, the consumption of training time and resource is also being gradually increased, and error rate also can
It is gradually increased.
CN105138991B discloses a kind of video feeling recognition methods based on the fusion of emotion significant characteristics, this method
Audio frequency characteristics and visual emotion feature, the word-based packet model structure of audio frequency characteristics are extracted to video lens each in training video set
At emotion distribution histogram feature;Visual emotion feature view-based access control model dictionary constitutes emotion attention feature, and emotion attention is special
Sign with emotion distribution histogram feature carry out it is top-down merge, constitute with emotion conspicuousness video features.This method
It is only extracted the feature of key frame of video when extracting visual emotion feature, is had ignored between video frame and frame to a certain extent
The incidence relation of feature.
Summary of the invention
The technical problems to be solved by the present invention are: providing the video for merging facial expression recognition and speech emotion recognition
Emotion identification method, is a kind of audiovisual emotion identification method that two based on decision level process is progressive, and this method is by video
In facial expression recognition and speech emotion recognition separate, identified using the progressive video feeling of two processes, pass through calculating
The mode of conditional probability carries out the technology of speech emotion recognition on the basis of facial expression recognition, and the present invention overcomes existing
Technology has ignored inner link between face characteristic and phonetic feature and video feeling identification in human emotion identifies
The defect that recognition speed is slow and discrimination is not high.
The present invention solves technical solution used by the technical problem: facial expression recognition and speech emotion recognition are melted
The video feeling recognition methods of conjunction is a kind of audiovisual emotion identification method that two based on decision level process is progressive, specifically
Steps are as follows:
Process A. is using facial image Expression Recognition as first time Classification and Identification:
Process A includes the first time point of the extraction of human face expression feature, the grouping of human face expression and facial expression recognition
Class, steps are as follows:
Vision signal is carried out the extraction that video takes out frame and voice signal by the first step:
Video in database is resolved into image frame sequence, and is regarded using the FormatFactory software of open source
Frequency takes out frame, and the voice signal in video is extracted and saves as MP3 format;
The pretreatment of second step, image frame sequence and voice signal:
The positioning of face is carried out simultaneously using disclosed Viola&Jones algorithm to the image frame sequence that the above-mentioned first step obtains
It cuts, the facial image size after cutting is normalized to M × M pixel, obtains the normalized image frame sequence of facial image size;
Speech detection is carried out simultaneously using well known voice activity detection algorithm VAD to the voice signal that the above-mentioned first step obtains
Noise and mute section are removed, the voice signal for being easier to extract feature is obtained;
Thus the pretreatment of image frame sequence and voice signal is completed;
Third step, image frame sequence carry out the label of human face characteristic point and screen the key frame in image frame sequence:
The normalized image frame sequence of above-mentioned second step facial image size is subjected to T characteristic point label of face, T's takes
Value range is 1,2 ..., 68, the position of 68 characteristic points be it is well known, the feature dot profile of label is respectively on facial image
Eyes, eyebrow, nose and mouth region, according to the coordinate of T characteristic point, for the facial image size in above-mentioned second step
U frame image in normalized image frame sequence calculates following 6 specific ranges:
Distance is D in vertical direction between eyes and eyebrowu,1: Du,1=dvertical||p22,p40| |,
Distance is D in the vertical direction that eyes openu,2: Du,2=dvertical||p45,p47| |,
Distance is D in vertical direction between eyes and mouthu,3: Du,3=dvertical||p37,p49| |,
Distance is D in vertical direction between nose and mouthu,4: Du,4=dvertical||p34,p52| |,
Distance is D in the vertical direction of upper lower lipu,5: Du,5=dvertical||p52,p58| |,
Width distance is D in the horizontal direction for the two sides of mouthu,6: Du,6=dhorizontal||p49,p55| |,
And have
dvertical||pi,pj| |=| pj,y-pi,y|,dhorizontal||pi,pj| |=| pj,x-pi,x| (1),
In formula (1), piFor the coordinate set of ith feature point, pjFor the coordinate set of j-th of characteristic point, pi,yFor i-th of spy
Levy the ordinate of point, pj,yFor the ordinate of j-th of characteristic point, pi,xFor the abscissa of ith feature point, pj,xFor j-th of feature
The abscissa of point, dvertical||pi,pj| | it is characterized the vertical range between point i and j, dhorizontal||pi,pj| | it is characterized a little
Horizontal distance between i and j, i=1,2 ..., 68, j=1,2 ..., 68;
If the first frame in the normalized image frame sequence of facial image size in above-mentioned second step is neutral frame, 6
The set V of a specific range0For shown in formula (2),
V0=[D0,1,D0,2,D0,3,D0,4,D0,5,D0,6] (2),
In formula (2), D0,1,D0,2,D0,3,D0,4,D0,5And D0,6Facial image size in respectively above-mentioned second step is returned
6 specific ranges corresponding to neutral frame in one image frame sequence changed;
6 specific ranges of the u frame in the normalized image frame sequence of facial image size in above-mentioned second step
Set VuFor shown in formula (3),
Vu=[Du,1,Du,2,Du,3,Du,4,Du,5,Du,6] (3),
In formula (3), u=1,2 ..., K-1, wherein K is normalized one group of facial image size in above-mentioned second step
The quantity of facial image, D in image frame sequenceu,1,Du,2,Du,3,Du,4,Du,5,Du,6Face figure in respectively above-mentioned second step
6 specific ranges as corresponding to the u frame in the normalized image frame sequence of size;
In the normalized image frame sequence of facial image size in above-mentioned second step 6 of u frame and neutral frame it is corresponding
The sum of specific range ratio for shown in formula (4),
In formula (4), DFuRepresent the neutrality in the normalized image frame sequence of facial image size in above-mentioned second step
The ratio between frame image 6 specific ranges corresponding with u frame image and, n represents the number of 6 specific ranges, D0,nIt represents above-mentioned
N-th of specific range corresponding to the neutral frame in the normalized image frame sequence of facial image size in second step, Du,nGeneration
N-th of specific range corresponding to the u frame in the normalized image frame sequence of facial image size in the above-mentioned second step of table;
In the normalized image frame sequence of facial image size in above-mentioned second step, according to formula (2), formula (3)
The ratio DF of specific range corresponding to each frame image in image frame sequence is acquired with formula (4), screening obtains maximum DF
Corresponding picture frame is the key frame in the image frame sequence,
Thus it completes to carry out image frame sequence the label of human face characteristic point and screens the key frame in image frame sequence;
4th step, the extraction of face textural characteristics:
Face textural characteristics are extracted using LBP-TOP algorithm, firstly, by the facial image size normalizing in above-mentioned second step
The image frame sequence of change, which is listed on space-time, is divided into tri- orthogonal planes of XY, XT and YT, and it is adjacent that 3 × 3 are calculated in each orthogonal plane
The LBP value of central pixel point in domain, counts the LBP histogram feature of three orthogonal planes, finally by the LBP of three orthogonal planes
Histogram links up to form global feature vector, and wherein LBP operator calculation method such as formula (5) and formula (6) are shown,
In formula (5) and formula (6), Z be center neighborhood of pixel points point number, R be neighborhood point to central pixel point it
Between distance, tcFor the pixel value of center pixel, tqFor the pixel value of q-th of neighborhood point, Sig (tq-tc) it is q-th of neighborhood
The LBP encoded radio of point,
LBP-TOP histogram definition such as formula (7) is shown,
In formula (7), b is the number of plane, and b=0 is X/Y plane, and b=1 is XT plane, and b=2 is YT plane, nbBe
The quantity of the dual mode generated in b-th of plane by LBP operator, I { LBPZ,R,b(x, y, t)=a } it is to be adopted in b-th of plane
Use LBPZ,ROperator carries out the pixel number that LBP encoded radio when feature extraction is a;
5th step, the extraction of Face geometric eigenvector:
The key frame in screening image frame sequence obtained according to above-mentioned third step calculates T marked in the key frame
The coordinate of characteristic point obtains the geometrical characteristic of human face expression, in facial expression recognition field, face characteristic region the most abundant
It is face T font region, mainly includes eyebrow, eyes, nose, chin and mouth region, therefore Face geometric eigenvector mentions
Method is taken mainly to extract the distance between the mark point in face T font region feature;
5.1st step calculates the Euclidean distance feature of human face characteristic point pair:
The 14 of eyebrow is chosen from T characteristic point in the key frame in the screening image frame sequence that above-mentioned third step obtains
6 pairs of features of 6 pairs of characteristic points and chin to characteristic point, 12 pairs of characteristic points of eyes, 12 pairs of characteristic points of mouth and nose
Point amounts to 50 pairs of characteristic points, and calculates characteristic point to the Euclidean distance between A and B, and totally 50 dimension Euclidean distance feature, is denoted as G50,
Characteristic point is calculated to the formula (8) of the Euclidean distance between A and B as follows,
In formula (8), pAIt is characterized the coordinate set of point A, pBIt is characterized the coordinate set of point B, pA,xIt is characterized the horizontal seat of point A
Mark, pA,yIt is characterized the ordinate of point A, pB,xIt is characterized the abscissa of point B, pB,yIt is characterized the ordinate of point B;
5.2nd step calculates the angle character of human face characteristic point:
10 angles that selection characterization face characteristic changes in T characteristic point of the key frame obtained from the screening of above-mentioned third step
Degree, wherein 22 angles of eyebrow, 6 angles of eyes and mouth angles are calculated, and extract angle character, totally 10 Wei Jiaodute
Sign, is denoted as Q10, the formula (9) for calculating characteristic point angle is as follows,
In formula (9), pC、pD、pEIt is eyebrow, eyes and the mouth region shape that above-mentioned third step marks human face characteristic point
At angle corresponding to three characteristic point coordinate sets, wherein pDFor corner point coordinate set;
5.3rd step calculates human face region area features:
5 regions of facial image, including left and right eyebrow, two eyes and mouth are selected, this 5 areas are calculated separately
The area features in domain, due to the otherness of everyone human face size, here by extracted 5 human face regions of key frame
Area with the area of neutral extracted 5 human face regions of frame is corresponding subtracts each other, obtain the variation spy of facial image region area
Sign, totally 5 dimensions are denoted as O5, face brow region, face mouth region and face eye areas are set as triangle, utilize sea
Human relations formula calculates each triangle area, by the Euclidean distance feature G of human face characteristic point pair50, the angle character of human face characteristic point
Q10With human face region area features O5It combines shown in the geometrical characteristic F such as formula (10) as face,
F=[G50 Q10 O5] (10),
So far, series connection face textural characteristics and Face geometric eigenvector complete the extraction of human face expression feature;
6th step, the grouping of human face expression:
By six kinds of emotions of face: it is surprised, fear, is angry, detesting, fast happy sadness, it is divided into three groups two-by-two, it is specific to be grouped
It is as follows:
First group: surprised, fear;Second group: angry, detest;Third group: happy, sad;
7th step, the first subseries of facial expression recognition:
The human face expression feature that above-mentioned 4th step and the 5th step are extracted is put into ELM classifier and is trained and surveys
Thus examination completes the first subseries of facial expression recognition, obtains the recognition result of the first subseries of facial expression recognition,
The parameter setting of middle ELM are as follows: ELM type: " classification ", hidden layer neuron number: " 20 ", activation primitive: " Sigmoid " letter
Number;
Process B. is using speech emotion recognition as second of Classification and Identification:
Process B is on the basis of the facial expression recognition result of process A, in conjunction with phonetic feature, respectively to above-mentioned
Each group of three groups in the grouping of 6th step human face expression carries out carrying out speech emotional feature extraction and speech emotion recognition
The second subseries, concrete operations are as follows:
8th step, the extraction of speech emotional feature:
For the classification results of the first subseries of above-mentioned 7th step facial expression recognition, according to the grouping of the 6th step, often
One group of emotion extracts different prosodic features to the difference of the sensitivity of different audio prosodic features respectively:
First group: zero-crossing rate ZCR and logarithmic energy LogE is extracted,
Second group: Teager energy operator TEO, zero-crossing rate ZCR, logarithmic energy LogE are extracted,
Third group: extracting pitch Pitch, zero-crossing rate ZCR, Teager energy operator TEO,
Pitch Pitch is to calculate in a frequency domain in above-mentioned prosodic features,
Voice signal M pretreated for above-mentioned second step calculates pitch Pitch with following formula (11),
In formula (11), Pitch is pitch, and DFT is discrete Fourier transform function, LMThe length of voice signal is represented,It represents voice signal and adds Hamming window,Be calculated as shown in following formula (12),
In formula (12), N is the quantity of Hamming window, and m is m-th of Hamming window;
Shown in the calculating such as formula (13) of zero-crossing rate ZCR in above-mentioned prosodic features,
In formula (13), ZCR indicates the Average zero-crossing rate of N number of window, | | be absolute value sign, X (m) be framing adding window it
The voice signal of m-th of window afterwards, sgn { X (m) } function judge the positive and negative of speech amplitude, and sgn { X (m) } function is by formula
(14) it calculates,
In formula (14), X (m) is the voice signal of m-th of window after framing adding window;
The calculation formula (15) of above-mentioned logarithmic energy LogE is as follows,
In formula (15), LogE indicates that total logarithmic energy of N number of window, X (m) are m-th of windows after framing adding window
The voice signal of mouth, N is number of windows;
Teager energy operator TEO definition such as formula (16) is shown,
In formula (16), ψ [X (m)] is Teager energy operator TEO, X ' (m)=dX (the m)/dm, X " of m-th of window
(m)=dX2(m)/dm2, for the signal of amplitude and frequency-invariant: X (m)=acos (φ m+ θ), wherein a is signal amplitude, φ
For signal frequency, θ is signal initial phase angle,
Each group of the corresponding audio file of picture frame of three groups in the grouping of above-mentioned 6th step human face expression is mentioned
Well known mel-frequency cepstrum coefficient MFCC and its first-order difference feature and second differnce feature are taken, finally extracts each group
Prosodic features and corresponding mel-frequency cepstrum coefficient MFCC and the differential dtex of one second differnce feature of seeking peace connect
Form the audio frequency characteristics of mixing,
Thus the extraction of speech emotional feature is completed;
9th step, the second subseries of speech emotion recognition:
The speech emotional feature that above-mentioned 8th step is extracted is put into SVM to be trained and test, finally obtains voice feelings
The other discrimination of perception, the wherein parameter setting of SVM are as follows: penalty coefficient: " 95 " allow redundancy to export: " 0 ", nuclear parameter: and " 1 ",
The kernel function of support vector machines: " Gaussian kernel ",
Thus the second subseries of speech emotion recognition is completed;
The fusion of process C. facial expression recognition and speech emotion recognition:
The fusion of tenth step, facial expression recognition and speech emotion recognition in decision level:
Since the speech emotion recognition of above process B is carried out on the basis of the face emotion recognition of above process A
Secondary identification, therefore the relationship of discrimination belongs to the relationship of conditional probability twice, final discrimination P (Audio_Visual) calculates
Shown in method such as formula (17),
P (Audio_Visual)=P (Visual) × P (Audio | Visual) (17),
In formula (17), P (Visual) is the discrimination of first time facial image identification, and P (Audio/Visual) is the
The discrimination of secondary speech emotional;
So far complete the progressive facial expression recognition of two processes based on decision level and speech emotion recognition fusion
Video feeling identification.
The above-mentioned video feeling recognition methods for merging facial expression recognition and speech emotion recognition, in the third step
According to the coordinate of T characteristic point, wherein T=68.
The above-mentioned video feeling recognition methods for merging facial expression recognition and speech emotion recognition, the sound end inspection
The English of method of determining and calculating is Voice Activity Detection, is abbreviated as VAD;The English of zero-crossing rate is Zero-Crossing
Rate is abbreviated as ZCR;The English of logarithmic energy is LogEnergy, is abbreviated as LogE;The English of mel-frequency cepstrum coefficient is
Mel-frequency cepstral coefficients, is abbreviated as MFCC;The English of Teager energy operator is Teager
Energy Operator is abbreviated as TEO, voice activity detection algorithm here, zero-crossing rate, logarithmic energy, mel-frequency cepstrum
Coefficient, Teager energy operator are well known to the art.
The above-mentioned video feeling recognition methods for merging facial expression recognition and speech emotion recognition, related calculating behaviour
Make method, is that those skilled in the art will appreciate that.
The beneficial effects of the present invention are: compared with prior art, substantive distinguishing features outstanding of the invention and marked improvement
It is as follows:
(1) present invention provides the video feeling recognition methods for merging facial expression recognition and speech emotion recognition, is one
Kind of the audiovisual emotion identification method based on decision level, this method is by the facial expression recognition and speech emotion recognition point in video
It opens, using the method for the progressive emotion recognition of two processes, by the calculation of conditional probability, in facial expression recognition
On the basis of the speech emotion recognition technology that carries out, fully take into account shadow of the result to speech emotion recognition of facial expression recognition
It rings, so that facial expression recognition and speech emotion recognition merge more closely, the two is mutually assisted, and is achieved more preferably
Human emotion's recognition effect, overcome it is of the existing technology human emotion identification in have ignored face characteristic and phonetic feature
Between inner link, the defect that the recognition speed of video feeling identification is slow and discrimination is not high.
(2) different emotions has different sensitivitys to different prosodic features, and Chien et al. is in 2014 in " A
The experiment of one " acoustic characteristic analysis " has been done in a new approach of audio emotion recognition " text,
6 kinds of emotions of the experimental demonstration are to prosodic features, i.e. Pitch, Zero-Crossing Rate, LogEnergy, Teager
The sensitivity of Energy Operator is different.Mel-frequency cepstrum coefficient Mel-Frequency of the paper to audio extraction
Cepstral Coefficients, MFCC, are classified using SVM classifier, when the classification of carry out two, four classification and six classification
When discrimination successively gradually decrease, point classification it is fewer, the classifying quality of classifier is better, thus in the present invention we the
Three classification of selection when facial expression classification, two classification of selection when second of audio classification.The method of the present invention will be polytypic
The problem of problem reduction is three classification and two classification, had not only reduced characteristic dimension but also had shortened the training time, be greatly improved
The efficiency of algorithm.
(3) compared with CN106529504A, the method for the present invention is advantageously not only extracted in video the present invention
Face characteristic, the audio frequency characteristics being also extracted in video, being combined with for face characteristic and audio frequency characteristics bimodal be conducive to view
The emotion of people in frequency does more accurate identification.
(4) for the present invention compared with CN105512609A, method proposed in CN105512609A only can recognize that video
In three kinds of emotions, and the present invention can identify six kinds of emotions in video, and average recognition rate of the invention compared to
Video feeling discrimination in CN105512609A is high by 9.92%.
(5) present invention is compared with CN105138991A, and the method for the present invention is advantageously by face characteristic and voice
Feature is classified respectively, " the dimension disaster " be easy to causeing when avoiding feature-based fusion, the fusion method in decision level, operation
Simply, it trains with recognition speed faster.
(6) present invention considers that different emotions are different to the sensitivity of different audio frequency characteristics when extracting audio frequency characteristics,
So that different audio frequency characteristics are extracted in each grouping, be conducive to the second subseries based on phonetic feature.
(7) present invention is extracted texture, geometry, time, prosodic features, and different characteristic reflects the different characteristics of expression,
Classifier can be preferably trained, carries out video feeling identification from multiple modalities.
(8) present invention has used secondary progressive sensibility classification method, based on recognition of face, supplemented by speech recognition, and two
Person is complementary, mutually assists, and more accurately video feeling identification may be implemented.
Detailed description of the invention
Present invention will be further explained below with reference to the attached drawings and examples.
Fig. 1 is the method for the present invention schematic process flow diagram.
Fig. 2 is the label schematic diagram of 68 characteristic points of 6 specific ranges and face.
Fig. 3 is the instance graph of 68 characteristic points label of a face in 05 database of eNTERFACE '.
Specific embodiment
Embodiment illustrated in fig. 1 shows the process of the method for the present invention are as follows: process A. is using facial image Expression Recognition as first
Subseries identification → by vision signal carry out video take out frame and voice signal extraction → image frame sequence and voice signal it is pre-
Processing → the label of human face characteristic point is carried out to image frame sequence and screens the key frame in image frame sequence → face texture spy
Extraction → Face geometric eigenvector extraction → human face expression the first subseries of grouping → facial expression recognition of sign;Process B.
Using speech emotion recognition as second point of extraction → speech emotion recognition of second of Classification and Identification → speech emotional feature
Class;Process C. facial expression recognition and the fusion → facial expression recognition and speech emotion recognition of speech emotion recognition are in decision level
On fusion → so far complete the progressive facial expression recognition of two processes based on decision level and speech emotion recognition fusion
Video feeling identification.
Embodiment illustrated in fig. 2 shows the label of 68 characteristic points of 6 specific ranges and face, is a width by characteristic point mark
The example image of note, 6 specific ranges are successively vertical ranges between characteristic point 22 and 40 with Du,1Indicate, characteristic point 45 with
Vertical range between 47 is with Du,2It indicates, the vertical range between characteristic point 37 and 49 is with Du,3Indicate, characteristic point 34 and 52 it
Between vertical range with Du,4It indicates, the vertical range between characteristic point 52 and 58 is with Du,5It indicates, between characteristic point 49 and 55
Horizontal distance is with Du,6It indicates.Line in figure between characteristic point has sketched out the profile of face eyebrow, eyes and mouth region.
Embodiment illustrated in fig. 3 shows that a face in 05 database of eNTERFACE ' is utilized Dlib into Ren Te by the present invention
The instance graph of sign point label, 68 characteristic points marked in the figure correspond to 68 shown in Fig. 2 human face characteristic point label schematic diagram
The label of a characteristic point.
Embodiment 1
The video feeling recognition methods that facial expression recognition and speech emotion recognition are merged of the present embodiment, is a kind of base
In the audiovisual emotion identification method that two processes of decision level are progressive, the specific steps are as follows:
Process A. is using facial image Expression Recognition as first time Classification and Identification:
Process A includes the first time point of the extraction of human face expression feature, the grouping of human face expression and facial expression recognition
Class, steps are as follows:
Vision signal is carried out the extraction that video takes out frame and voice signal by the first step:
Video in database is resolved into image frame sequence, and is regarded using the FormatFactory software of open source
Frequency takes out frame, and the voice signal in video is extracted and saves as MP3 format;
The pretreatment of second step, image frame sequence and voice signal:
The positioning of face is carried out simultaneously using disclosed Viola&Jones algorithm to the image frame sequence that the above-mentioned first step obtains
It cuts, the facial image size after cutting is normalized to M × M pixel, obtains the normalized image frame sequence of facial image size;
Speech detection is carried out simultaneously using well known voice activity detection algorithm VAD to the voice signal that the above-mentioned first step obtains
Noise and mute section are removed, the voice signal for being easier to extract feature is obtained;
Thus the pretreatment of image frame sequence and voice signal is completed;
Third step, image frame sequence carry out the label of human face characteristic point and screen the key frame in image frame sequence:
The normalized image frame sequence of above-mentioned second step facial image size is subjected to T characteristic point label of face, T's takes
Value range is 1,2 ..., 68, the position of 68 characteristic points be it is well known, the feature dot profile of label is respectively on facial image
Eyes, eyebrow, nose and mouth region, according to the coordinate of T=68 characteristic point in the present embodiment, in above-mentioned second step
The normalized image frame sequence of facial image size in u frame image calculate following 6 specific ranges:
Distance is D in vertical direction between eyes and eyebrowu,1: Du,1=dvertical||p22,p40| |,
Distance is D in the vertical direction that eyes openu,2: Du,2=dvertical||p45,p47| |,
Distance is D in vertical direction between eyes and mouthu,3: Du,3=dvertical||p37,p49| |,
Distance is D in vertical direction between nose and mouthu,4: Du,4=dvertical||p34,p52| |,
Distance is D in the vertical direction of upper lower lipu,5: Du,5=dvertical||p52,p58| |,
Width distance is D in the horizontal direction for the two sides of mouthu,6: Du,6=dhorizontal||p49,p55| |,
And have
dvertical||pi,pj| |=| pj,y-pi,y|,dhorizontal||pi,pj| |=| pj,x-pi,x| (1),
In formula (1), piFor the coordinate set of ith feature point, pjFor the coordinate set of j-th of characteristic point, pi,yFor i-th of spy
Levy the ordinate of point, pj,yFor the ordinate of j-th of characteristic point, pi,xFor the abscissa of ith feature point, pj,xFor j-th of feature
The abscissa of point, dvertical||pi,pj| | it is characterized the vertical range between point i and j, dhorizontal||pi,pj| | it is characterized a little
Horizontal distance between i and j, i=1,2 ..., 68, j=1,2 ..., 68;
If the first frame in the normalized image frame sequence of facial image size in above-mentioned second step is neutral frame, 6
The set V of a specific range0For shown in formula (2),
V0=[D0,1,D0,2,D0,3,D0,4,D0,5,D0,6] (2),
In formula (2), D0,1,D0,2,D0,3,D0,4,D0,5And D0,6Facial image size in respectively above-mentioned second step is returned
6 specific ranges corresponding to neutral frame in one image frame sequence changed;
6 specific ranges of the u frame in the normalized image frame sequence of facial image size in above-mentioned second step
Set VuFor shown in formula (3),
Vu=[Du,1,Du,2,Du,3,Du,4,Du,5,Du,6] (3),
In formula (3), u=1,2 ..., K-1, wherein K is normalized one group of facial image size in above-mentioned second step
The quantity of facial image, D in image frame sequenceu,1,Du,2,Du,3,Du,4,Du,5,Du,6Face figure in respectively above-mentioned second step
6 specific ranges as corresponding to the u frame in the normalized image frame sequence of size;
In the normalized image frame sequence of facial image size in above-mentioned second step 6 of u frame and neutral frame it is corresponding
The sum of specific range ratio for shown in formula (4),
In formula (4), DFuRepresent the neutrality in the normalized image frame sequence of facial image size in above-mentioned second step
The ratio between frame image 6 specific ranges corresponding with u frame image and, n represents the number of 6 specific ranges, D0,nIt represents above-mentioned
N-th of specific range corresponding to the neutral frame in the normalized image frame sequence of facial image size in second step, Du,nGeneration
N-th of specific range corresponding to the u frame in the normalized image frame sequence of facial image size in the above-mentioned second step of table;
In the normalized image frame sequence of facial image size in above-mentioned second step, according to formula (2), formula (3)
The ratio DF of specific range corresponding to each frame image in image frame sequence is acquired with formula (4), screening obtains maximum DF
Corresponding picture frame is the key frame in the image frame sequence,
Thus it completes to carry out image frame sequence the label of human face characteristic point and screens the key frame in image frame sequence;
4th step, the extraction of face textural characteristics:
Face textural characteristics are extracted using LBP-TOP algorithm, firstly, by the facial image size normalizing in above-mentioned second step
The image frame sequence of change, which is listed on space-time, is divided into tri- orthogonal planes of XY, XT and YT, and it is adjacent that 3 × 3 are calculated in each orthogonal plane
The LBP value of central pixel point in domain, counts the LBP histogram feature of three orthogonal planes, finally by the LBP of three orthogonal planes
Histogram links up to form global feature vector, and wherein LBP operator calculation method such as formula (5) and formula (6) are shown,
In formula (5) and formula (6), Z be center neighborhood of pixel points point number, R be neighborhood point to central pixel point it
Between distance, tcFor the pixel value of center pixel, tqFor the pixel value of q-th of neighborhood point, Sig (tq-tc) it is q-th of neighborhood
The LBP encoded radio of point,
LBP-TOP histogram definition such as formula (7) is shown,
In formula (7), b is the number of plane, and b=0 is X/Y plane, and b=1 is XT plane, and b=2 is YT plane, nbBe
The quantity of the dual mode generated in b-th of plane by LBP operator, I { LBPZ,R,b(x, y, t)=a } it is to be adopted in b-th of plane
Use LBPZ,ROperator carries out the pixel number that LBP encoded radio when feature extraction is a;
5th step, the extraction of Face geometric eigenvector:
The key frame in screening image frame sequence obtained according to above-mentioned third step calculates T marked in the key frame
The coordinate of characteristic point obtains the geometrical characteristic of human face expression, in facial expression recognition field, face characteristic region the most abundant
It is face T font region, mainly includes eyebrow, eyes, nose, chin and mouth region, specific features point is shown in Table 2, therefore
The extracting method of Face geometric eigenvector mainly extracts the distance between the mark point in face T font region feature;
5.1st step calculates the Euclidean distance feature of human face characteristic point pair:
The 14 of eyebrow is chosen from T characteristic point in the key frame in the screening image frame sequence that above-mentioned third step obtains
6 pairs of features of 6 pairs of characteristic points and chin to characteristic point, 12 pairs of characteristic points of eyes, 12 pairs of characteristic points of mouth and nose
Point amounts to 50 pairs of characteristic points, and calculates characteristic point to the Euclidean distance between A and B, and totally 50 dimension Euclidean distance feature, is denoted as G50,
Characteristic point is calculated to the formula (8) of the Euclidean distance between A and B as follows,
In formula (8), pAIt is characterized the coordinate set of point A, pBIt is characterized the coordinate set of point B, pA,xIt is characterized the horizontal seat of point A
Mark, pA,yIt is characterized the ordinate of point A, pB,xIt is characterized the abscissa of point B, pB,yIt is characterized the ordinate of point B;
Table 1 shows in face T font region, calculative human face characteristic point pair, wherein d | | pA,pB| | indicate feature
Point is to the Euclidean distance between A, B;
Table 1
5.2nd step calculates the angle character of human face characteristic point:
Selection characterization face characteristic changes in T=68 characteristic point of the key frame obtained from the screening of above-mentioned third step 10
A angle, wherein 22 angles of eyebrow, 6 angles of eyes and mouth angles are calculated, and extract angle character, totally 10 Wei Jiao
Feature is spent, Q is denoted as10, specific angle is shown in Table 3, and the formula (9) for calculating characteristic point angle is as follows,
In formula (9), pC、pD、pEIt is eyebrow, eyes and the mouth region shape that above-mentioned third step marks human face characteristic point
At angle corresponding to three characteristic point coordinate sets, wherein pDFor corner point coordinate set;
Table 2 shows in face T font region, the angle of calculative human face characteristic point.Wherein Q (pC,pD,pE) table
Show the angle character of angle D;
Table 2
5.3rd step calculates human face region area features:
5 regions of facial image, including left and right eyebrow, two eyes and mouth are selected, this 5 areas are calculated separately
The area features in domain, specific surface area are shown in Table 3;
Table 3
Table 3 shows in face T font region, the area of calculative human face characteristic point area defined, wherein O
(pA,pB,pC,pD) indicate the area in region that characteristic point A, B, C, D line surround;
Due to the otherness of everyone human face size, here by 5 human face regions in the extracted table 4 of key frame
Area with the area of neutral extracted 5 human face regions of frame is corresponding subtracts each other, obtain the variation spy of facial image region area
Sign, totally 5 dimensions are denoted as O5, face brow region, face mouth region and face eye areas are set as triangle, utilize sea
Human relations formula calculates each triangle area, by the Euclidean distance feature G of human face characteristic point pair50, the angle character of human face characteristic point
Q10With human face region area features O5It combines shown in the geometrical characteristic F such as formula (10) as face,
F=[G50 Q10 O5] (10),
So far, series connection face textural characteristics and Face geometric eigenvector complete the extraction of human face expression feature;
6th step, the grouping of human face expression:
By six kinds of emotions of face: it is surprised, fear, is angry, detesting, fast happy sadness, it is divided into three groups two-by-two, it is specific to be grouped
It is as follows:
First group: surprised, fear;Second group: angry, detest;Third group: happy, sad;
7th step, the first subseries of facial expression recognition:
The human face expression feature that above-mentioned 4th step and the 5th step are extracted is put into ELM classifier and is trained and surveys
Thus examination completes the first subseries of facial expression recognition, obtains the recognition result of the first subseries of facial expression recognition,
The parameter setting of middle ELM are as follows: ELM type: " classification ", hidden layer neuron number: " 20 ", activation primitive: " Sigmoid " letter
Number;
Process B. is using speech emotion recognition as second of Classification and Identification:
Process B is on the basis of the facial expression recognition result of process A, in conjunction with phonetic feature, respectively to above-mentioned
The of each group of carry out speech emotional feature extraction of three groups in the grouping of the 6th step human face expression and speech emotion recognition
Secondary classification, concrete operations are as follows:
8th step, the extraction of speech emotional feature:
For the classification results of the first subseries of above-mentioned 7th step facial expression recognition, according to the grouping of the 6th step, often
One group of emotion extracts different prosodic features to the difference of the sensitivity of different audio prosodic features respectively:
First group: zero-crossing rate ZCR and logarithmic energy LogE is extracted,
Second group: Teager energy operator TEO, zero-crossing rate ZCR, logarithmic energy LogE are extracted,
Third group: extracting pitch Pitch, zero-crossing rate ZCR, Teager energy operator TEO,
Pitch Pitch is to calculate in a frequency domain in above-mentioned prosodic features,
Voice signal M pretreated for above-mentioned second step calculates pitch Pitch with following formula (11),
In formula (11), Pitch is pitch, and DFT is discrete Fourier transform function, LMThe length of voice signal is represented,It represents voice signal and adds Hamming window,Be calculated as shown in following formula (12),
In formula (12), N is the quantity of Hamming window, and m is m-th of Hamming window;
Shown in the calculating such as formula (13) of zero-crossing rate ZCR in above-mentioned prosodic features,
In formula (13), ZCR indicates the Average zero-crossing rate of N number of window, | | be absolute value sign, X (m) be framing adding window it
The voice signal of m-th of window afterwards, sgn { X (m) } function judge the positive and negative of speech amplitude, and sgn { X (m) } function is by formula
(14) it calculates,
In formula (14), X (m) is the voice signal of m-th of window after framing adding window;
The calculation formula (15) of above-mentioned logarithmic energy LogE is as follows,
In formula (15), LogE indicates that total logarithmic energy of N number of window, X (m) are m-th of windows after framing adding window
The voice signal of mouth, N is number of windows;
Teager energy operator TEO definition such as formula (16) is shown,
In formula (16), ψ [X (m)] is Teager energy operator TEO, X ' (m)=dX (the m)/dm, X " of m-th of window
(m)=dX2(m)/dm2, for the signal of amplitude and frequency-invariant: X (m)=acos (φ m+ θ), wherein a is signal amplitude, φ
For signal frequency, θ is signal initial phase angle,
Each group of the corresponding audio file of picture frame of three groups in the grouping of above-mentioned 6th step human face expression is mentioned
Well known mel-frequency cepstrum coefficient MFCC and its first-order difference feature and second differnce feature are taken, finally extracts each group
Prosodic features and corresponding mel-frequency cepstrum coefficient MFCC and the differential dtex of one second differnce feature of seeking peace connect
Form the audio frequency characteristics of mixing,
Thus the extraction of speech emotional feature is completed;
9th step, the second subseries of speech emotion recognition:
The speech emotional feature that above-mentioned 8th step is extracted is put into SVM to be trained and test, finally obtains voice feelings
The other discrimination of perception, the wherein parameter setting of SVM are as follows: penalty coefficient: " 95 " allow redundancy to export: " 0 ", nuclear parameter: and " 1 ",
The kernel function of support vector machines: " Gaussian kernel ",
Thus the second subseries of speech emotion recognition is completed;
The fusion of process C. facial expression recognition and speech emotion recognition:
The fusion of tenth step, facial expression recognition and speech emotion recognition in decision level:
Since speech emotion recognition is the secondary identification carried out on the basis of face emotion recognition, discrimination twice
Relationship belong to the relationship of conditional probability, shown in final discrimination P (Audio_Visual) calculation method such as formula (17),
P (Audio_Visual)=P (Visual) × P (Audio | Visual) (17),
In formula (17), P (Visual) is the discrimination of first time facial image identification, and P (Audio | Visual) is the
The discrimination of secondary speech emotional;
So far complete the progressive facial expression recognition of two processes based on decision level and speech emotion recognition fusion
Video feeling identification.
The present embodiment compares experiment to existing relevant technology on eNTERFACE ' 05 and RML database, specific to know
Not other rate such as the following table 4:
Table 4
The experimental result of table 4 list in recent years audiovisual emotion recognition system on eNTERFACE ' 05 and RML database
Discrimination comparison: Mahdi Bejani et al. 2014 in " Audiovisual emotion recognition using
In ANOVA feature selection method and multi-classifier neural networks " document
The average recognition rate for the audiovisual emotion recognition done on 05 database of eNTERFACE ' is 77.78%;
Shiqing Zhang et al. 2016 in " Multimodal Deep Convolutional Neural
The audiovisual emotion done on RML database in Network for Audio-Visual Emotion Recognition " document
The average recognition rate of identification is 74.32%;
Shiqing Zhang et al. 2017 is in " Learning Affective Features with a Hybrid
In eNTERFACE ' 05 and RML data in Deep Model for Audio-Visual Emotion Recognition " document
The average recognition rate for the audiovisual emotion recognition done on library is respectively 85.97% and 80.36.%;
Yaxiong Ma et al. 2018 " Audio-visual emotion fusion (AVEF): A deep
The audiovisual emotion recognition done on eNTERFACE ' 05 and RML database in efficient weighted approach " document
Average recognition rate be respectively 84.56% and 81.98%;Two based on the decision level process that the present embodiment uses is progressive
Audiovisual emotion identification method, has a distinct increment on discrimination compared with paper in recent years.
In the present embodiment, the English of the voice activity detection algorithm is Voice Activity Detection, contracting
It is written as VAD, the English of logarithmic energy is LogEnergy, is abbreviated as LogE;The English of zero-crossing rate is Zero-Crossing
Rate is abbreviated as ZCR;The English of Teager energy operator is Teager Energy Operator, is abbreviated as TEO;Meier frequency
The English of rate cepstrum coefficient is Mel-frequency cepstral coefficients, is abbreviated as MFCC, end-speech here
Point detection algorithm, logarithmic energy, zero-crossing rate, Teager energy operator, mel-frequency cepstrum coefficient are that the art institute is public
Know.
In the present embodiment, related calculating operation method is that those skilled in the art will appreciate that.
Claims (2)
1. the video feeling recognition methods that facial expression recognition and speech emotion recognition are merged, it is characterised in that: be a kind of base
In the audiovisual emotion identification method that two processes of decision level are progressive, the specific steps are as follows:
Process A. is using facial image Expression Recognition as first time Classification and Identification:
Process A includes the first subseries of the extraction of human face expression feature, the grouping of human face expression and facial expression recognition, step
It is rapid as follows:
Vision signal is carried out the extraction that video takes out frame and voice signal by the first step:
Video in database is resolved into image frame sequence, and carries out video pumping using the FormatFactory software of open source
Voice signal in video is extracted and saves as MP3 format by frame;
The pretreatment of second step, image frame sequence and voice signal:
The image frame sequence obtained to the above-mentioned first step carries out positioning and the sanction of face using disclosed Viola&Jones algorithm
It cuts, the facial image size after cutting is normalized to M × M pixel, obtains the normalized image frame sequence of facial image size;
Speech detection is carried out using well known voice activity detection algorithm VAD to the voice signal that the above-mentioned first step obtains and is removed
Noise and mute section obtain the voice signal for being easier to extract feature;
Thus the pretreatment of image frame sequence and voice signal is completed;
Third step, image frame sequence carry out the label of human face characteristic point and screen the key frame in image frame sequence:
The normalized image frame sequence of above-mentioned second step facial image size is subjected to T characteristic point label of face, the value model of T
Enclose is 1,2 ..., 68, the position of 68 characteristic points is well known, the feature dot profile of the label eye on facial image respectively
Eyeball, eyebrow, nose and mouth region, according to the coordinate of T characteristic point, for the facial image size normalizing in above-mentioned second step
U frame image in the image frame sequence of change calculates following 6 specific ranges:
Distance is D in vertical direction between eyes and eyebrowu,1: Du,1=dvertical||p22,p40| |,
Distance is D in the vertical direction that eyes openu,2: Du,2=dvertical||p45,p47| |,
Distance is D in vertical direction between eyes and mouthu,3: Du,3=dvertical||p37,p49| |,
Distance is D in vertical direction between nose and mouthu,4: Du,4=dvertical||p34,p52| |,
Distance is D in the vertical direction of upper lower lipu,5: Du,5=dvertical||p52,p58| |,
Width distance is D in the horizontal direction for the two sides of mouthu,6: Du,6=dhorizontal||p49,p55| |,
And have
dvertical||pi,pj| |=| pj,y-pi,y|,dhorizontal||pi,pj| |=| pj,x-pi,x| (1),
In formula (1), piFor the coordinate set of ith feature point, pjFor the coordinate set of j-th of characteristic point, pi,yFor ith feature point
Ordinate, pj,yFor the ordinate of j-th of characteristic point, pi,xFor the abscissa of ith feature point, pj,xFor j-th characteristic point
Abscissa, dvertical||pi,pj| | it is characterized the vertical range between point i and j, dhorizontal||pi,pj| | it is characterized point i and j
Between horizontal distance, i=1,2 ..., 68, j=1,2 ..., 68;
If the first frame in the normalized image frame sequence of facial image size in above-mentioned second step is neutral frame, 6 spies
The set V of set a distance0For shown in formula (2),
V0=[D0,1,D0,2,D0,3,D0,4,D0,5,D0,6] (2),
In formula (2), D0,1,D0,2,D0,3,D0,4,D0,5And D0,6Facial image size normalization in respectively above-mentioned second step
Image frame sequence in neutral frame corresponding to 6 specific ranges;
The set V of 6 specific ranges of the u frame in the normalized image frame sequence of facial image size in above-mentioned second stepu
For shown in formula (3),
Vu=[Du,1,Du,2,Du,3,Du,4,Du,5,Du,6] (3),
In formula (3), u=1,2 ..., K-1, wherein K is the normalized one group of image of facial image size in above-mentioned second step
The quantity of facial image, D in frame sequenceu,1,Du,2,Du,3,Du,4,Du,5,Du,6Facial image ruler in respectively above-mentioned second step
6 specific ranges corresponding to u frame in very little normalized image frame sequence;
6 corresponding spies of u frame and neutral frame in the normalized image frame sequence of facial image size in above-mentioned second step
The sum of set a distance ratio is that formula (4) are shown,
In formula (4), DFuRepresent the neutral frame figure in the normalized image frame sequence of facial image size in above-mentioned second step
As the ratio between 6 specific ranges corresponding with u frame image and, n represents the number of 6 specific ranges, D0,nRepresent above-mentioned second
N-th of specific range corresponding to the neutral frame in the normalized image frame sequence of facial image size in step, Du,nIn representative
State n-th of specific range corresponding to the u frame in the normalized image frame sequence of facial image size in second step;
In the normalized image frame sequence of facial image size in above-mentioned second step, according to formula (2), formula (3) and public affairs
Formula (4) acquires the ratio DF of specific range corresponding to each frame image in image frame sequence, and it is right that screening obtains maximum DF institute
The picture frame answered is the key frame in the image frame sequence,
Thus it completes to carry out image frame sequence the label of human face characteristic point and screens the key frame in image frame sequence;
4th step, the extraction of face textural characteristics:
Face textural characteristics are extracted using LBP-TOP algorithm, firstly, the facial image size in above-mentioned second step is normalized
Image frame sequence, which is listed on space-time, is divided into tri- orthogonal planes of XY, XT and YT, calculates in 3 × 3 neighborhoods in each orthogonal plane
The LBP value of central pixel point counts the LBP histogram feature of three orthogonal planes, finally by the LBP histogram of three orthogonal planes
Figure links up to form global feature vector, and wherein LBP operator calculation method such as formula (5) and formula (6) are shown,
In formula (5) and formula (6), Z is the number of center neighborhood of pixel points point, and R is neighborhood point between central pixel point
Distance, tcFor the pixel value of center pixel, tqFor the pixel value of q-th of neighborhood point, Sig (tq-tc) it is q-th of neighborhood point
LBP encoded radio,
LBP-TOP histogram definition such as formula (7) is shown,
In formula (7), b is the number of plane, and b=0 is X/Y plane, and b=1 is XT plane, and b=2 is YT plane, nbIt is at b-th
The quantity of the dual mode generated in plane by LBP operator, I { LBPZ,R,b(x, y, t)=a } it is to be used in b-th of plane
LBPZ,ROperator carries out the pixel number that LBP encoded radio when feature extraction is a;
5th step, the extraction of Face geometric eigenvector:
The key frame in screening image frame sequence obtained according to above-mentioned third step, calculates the T feature marked in the key frame
The coordinate of point obtains the geometrical characteristic of human face expression, and in facial expression recognition field, face characteristic region the most abundant is people
Face T font region mainly includes eyebrow, eyes, nose, chin and mouth region, therefore the extraction side of Face geometric eigenvector
Method mainly extracts the distance between the mark point in face T font region feature;
5.1st step calculates the Euclidean distance feature of human face characteristic point pair:
14 couples of spies of eyebrow are chosen from T characteristic point in the key frame in the screening image frame sequence that above-mentioned third step obtains
6 pairs of characteristic points of point, 12 pairs of characteristic points of eyes, 12 pairs of characteristic points of mouth and nose and 6 pairs of characteristic points of chin are levied, altogether
50 pairs of characteristic points are counted, and calculate characteristic point to the Euclidean distance between A and B, totally 50 dimension Euclidean distance feature, is denoted as G50, calculate
Characteristic point to the formula (8) of the Euclidean distance between A and B as follows,
In formula (8), pAIt is characterized the coordinate set of point A, pBIt is characterized the coordinate set of point B, pA,xIt is characterized the abscissa of point A, pA,y
It is characterized the ordinate of point A, pB,xIt is characterized the abscissa of point B, pB,yIt is characterized the ordinate of point B;
5.2nd step calculates the angle character of human face characteristic point:
10 angles that selection characterization face characteristic changes in T characteristic point of the key frame obtained from the screening of above-mentioned third step,
Wherein 22 angles of eyebrow, 6 angles of eyes and mouth angles are calculated, and extract angle character, totally 10 dimension angle character,
It is denoted as Q10, the formula (9) for calculating characteristic point angle is as follows,
In formula (9), pC、pD、pEIt is that eyebrow, eyes and mouth region that above-mentioned third step marks human face characteristic point are formed
Three characteristic point coordinate sets corresponding to angle, wherein pDFor corner point coordinate set;
5.3rd step calculates human face region area features:
5 regions of facial image, including left and right eyebrow, two eyes and mouth are selected, this 5 regions are calculated separately
Area features, due to the otherness of everyone human face size, here by the face of extracted 5 human face regions of key frame
Product is corresponding with the area of extracted 5 human face regions of neutral frame to subtract each other, and obtains the variation characteristic of facial image region area, altogether
5 dimensions are denoted as O5, face brow region, face mouth region and face eye areas are set as triangle, utilize Helen's public affairs
Formula calculates each triangle area, by the Euclidean distance feature G of human face characteristic point pair50, the angle character Q of human face characteristic point10With
Human face region area features O5It combines shown in the geometrical characteristic F such as formula (10) as face,
F=[G50 Q10 O5] (10),
So far, series connection face textural characteristics and Face geometric eigenvector complete the extraction of human face expression feature;
6th step, the grouping of human face expression:
By six kinds of emotions of face: surprised, fear, is angry, detesting, fast happy sadness, be divided into three groups two-by-two, specific grouping is such as
Under:
First group: surprised, fear;Second group: angry, detest;Third group: happy, sad;
7th step, the first subseries of facial expression recognition:
The human face expression feature that above-mentioned 4th step and the 5th step are extracted is put into ELM classifier and is trained and tests, by
This completes the first subseries of facial expression recognition, obtains the recognition result of the first subseries of facial expression recognition, wherein ELM
Parameter setting are as follows: ELM type: " classification ", hidden layer neuron number: " 20 ", activation primitive: " Sigmoid " function;
Process B. is using speech emotion recognition as second of Classification and Identification:
Process B is on the basis of the facial expression recognition result of process A, in conjunction with phonetic feature, respectively to the above-mentioned 6th
It walks each group of three groups in the grouping of human face expression and carry out the of speech emotional feature extraction and speech emotion recognition
Secondary classification, concrete operations are as follows:
8th step, the extraction of speech emotional feature:
For the classification results of the first subseries of above-mentioned 7th step facial expression recognition, according to the grouping of the 6th step, each group
Emotion to the difference of the sensitivitys of different audio prosodic features, extract different prosodic features respectively:
First group: zero-crossing rate ZCR and logarithmic energy LogE is extracted,
Second group: Teager energy operator TEO, zero-crossing rate ZCR, logarithmic energy LogE are extracted,
Third group: extracting pitch Pitch, zero-crossing rate ZCR, Teager energy operator TEO,
Pitch Pitch is to calculate in a frequency domain in above-mentioned prosodic features,
Voice signal M pretreated for above-mentioned second step calculates pitch Pitch with following formula (11),
In formula (11), Pitch is pitch, and DFT is discrete Fourier transform function, LMThe length of voice signal is represented,
It represents voice signal and adds Hamming window,Be calculated as shown in following formula (12),
In formula (12), N is the quantity of Hamming window, and m is m-th of Hamming window;
Shown in the calculating such as formula (13) of zero-crossing rate ZCR in above-mentioned prosodic features,
In formula (13), ZCR indicates the Average zero-crossing rate of N number of window, | | it is absolute value sign, X (m) is after framing adding window
The voice signal of m-th of window, sgn { X (m) } function judge the positive and negative of speech amplitude, and sgn { X (m) } function is counted by formula (14)
It calculates,
In formula (14), X (m) is the voice signal of m-th of window after framing adding window;
The calculation formula (15) of above-mentioned logarithmic energy LogE is as follows,
In formula (15), LogE indicates that total logarithmic energy of N number of window, X (m) are m-th of windows after framing adding window
Voice signal, N are number of windows;
Teager energy operator TEO definition such as formula (16) is shown,
In formula (16), ψ [X (m)] be m-th of window Teager energy operator TEO, X ' (m)=dX (m)/dm, X " (m)=
dX2(m)/dm2, for the signal of amplitude and frequency-invariant: X (m)=acos (φ m+ θ), wherein a is signal amplitude, and φ is signal
Frequency, θ are signal initial phase angle,
The corresponding audio file of each group of picture frame of three groups in the grouping of above-mentioned 6th step human face expression is extracted public
The mel-frequency cepstrum coefficient MFCC and its first-order difference feature and second differnce feature known, the rhythm for finally extracting each group
Rule feature and corresponding mel-frequency cepstrum coefficient MFCC and the differential dtex of one second differnce feature of seeking peace are together in series shape
At mixed audio frequency characteristics,
Thus the extraction of speech emotional feature is completed;
9th step, the second subseries of speech emotion recognition:
The speech emotional feature that above-mentioned 8th step is extracted is put into SVM to be trained and test, finally obtains speech emotional knowledge
Other discrimination, the wherein parameter setting of SVM are as follows: penalty coefficient: " 95 " allow redundancy to export: " 0 ", nuclear parameter: " 1 " is supported
The kernel function of vector machine: " Gaussian kernel ",
Thus the second subseries of speech emotion recognition is completed;
The fusion of process C. facial expression recognition and speech emotion recognition:
The fusion of tenth step, facial expression recognition and speech emotion recognition in decision level:
Due to the speech emotion recognition of above process B be carried out on the basis of the face emotion recognition of above process A it is secondary
Identification, therefore the relationship of discrimination belongs to the relationship of conditional probability, final discrimination P (Audio_Visual) calculation method twice
As shown in formula (17),
P (Audio_Visual)=P (Visual) × P (Audio | Visual) (17),
In formula (17), P (Visual) is the discrimination of first time facial image identification, and P (Audio/Visual) is second
The discrimination of speech emotional;
So far the video of the progressive facial expression recognition of two processes based on decision level and speech emotion recognition fusion is completed
Emotion recognition.
2. the video feeling recognition methods for according to claim 1 merging facial expression recognition and speech emotion recognition,
It is characterized in that: the coordinate according to T characteristic point in the third step, wherein T=68.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811272233.1A CN109409296B (en) | 2018-10-30 | 2018-10-30 | Video emotion recognition method integrating facial expression recognition and voice emotion recognition |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811272233.1A CN109409296B (en) | 2018-10-30 | 2018-10-30 | Video emotion recognition method integrating facial expression recognition and voice emotion recognition |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109409296A true CN109409296A (en) | 2019-03-01 |
CN109409296B CN109409296B (en) | 2020-12-01 |
Family
ID=65470610
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811272233.1A Expired - Fee Related CN109409296B (en) | 2018-10-30 | 2018-10-30 | Video emotion recognition method integrating facial expression recognition and voice emotion recognition |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109409296B (en) |
Cited By (22)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109961054A (en) * | 2019-03-29 | 2019-07-02 | 山东大学 | It is a kind of based on area-of-interest characteristic point movement anxiety, depression, angry facial expression recognition methods |
CN110363074A (en) * | 2019-06-03 | 2019-10-22 | 华南理工大学 | One kind identifying exchange method for complicated abstract class of things peopleization |
CN110414335A (en) * | 2019-06-20 | 2019-11-05 | 北京奇艺世纪科技有限公司 | Video frequency identifying method, device and computer readable storage medium |
CN110443143A (en) * | 2019-07-09 | 2019-11-12 | 武汉科技大学 | The remote sensing images scene classification method of multiple-limb convolutional neural networks fusion |
CN110677598A (en) * | 2019-09-18 | 2020-01-10 | 北京市商汤科技开发有限公司 | Video generation method and device, electronic equipment and computer storage medium |
CN110909613A (en) * | 2019-10-28 | 2020-03-24 | Oppo广东移动通信有限公司 | Video character recognition method and device, storage medium and electronic equipment |
CN111144197A (en) * | 2019-11-08 | 2020-05-12 | 宇龙计算机通信科技(深圳)有限公司 | Human identification method, device, storage medium and electronic equipment |
CN111178389A (en) * | 2019-12-06 | 2020-05-19 | 杭州电子科技大学 | Multi-mode depth layered fusion emotion analysis method based on multi-channel tensor pooling |
CN111553311A (en) * | 2020-05-13 | 2020-08-18 | 吉林工程技术师范学院 | Micro-expression recognition robot and control method thereof |
CN111798874A (en) * | 2020-06-24 | 2020-10-20 | 西北师范大学 | Voice emotion recognition method and system |
CN112101462A (en) * | 2020-09-16 | 2020-12-18 | 北京邮电大学 | Electromechanical device audio-visual information fusion method based on BMFCC-GBFB-DNN |
CN112308102A (en) * | 2019-08-01 | 2021-02-02 | 北京易真学思教育科技有限公司 | Image similarity calculation method, calculation device, and storage medium |
CN112418095A (en) * | 2020-11-24 | 2021-02-26 | 华中师范大学 | Facial expression recognition method and system combined with attention mechanism |
CN112488219A (en) * | 2020-12-07 | 2021-03-12 | 江苏科技大学 | Mood consolation method and system based on GRU and mobile terminal |
CN112766112A (en) * | 2021-01-08 | 2021-05-07 | 山东大学 | Dynamic expression recognition method and system based on space-time multi-feature fusion |
CN112949560A (en) * | 2021-03-24 | 2021-06-11 | 四川大学华西医院 | Method for identifying continuous expression change of long video expression interval under two-channel feature fusion |
CN113065449A (en) * | 2021-03-29 | 2021-07-02 | 济南大学 | Face image acquisition method and device, computer equipment and storage medium |
CN113076847A (en) * | 2021-03-29 | 2021-07-06 | 济南大学 | Multi-mode emotion recognition method and system |
CN113111789A (en) * | 2021-04-15 | 2021-07-13 | 山东大学 | Facial expression recognition method and system based on video stream |
CN113128399A (en) * | 2021-04-19 | 2021-07-16 | 重庆大学 | Speech image key frame extraction method for emotion recognition |
CN114005153A (en) * | 2021-02-01 | 2022-02-01 | 南京云思创智信息科技有限公司 | Real-time personalized micro-expression recognition method for face diversity |
CN117577140A (en) * | 2024-01-16 | 2024-02-20 | 北京岷德生物科技有限公司 | Speech and facial expression data processing method and system for cerebral palsy children |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1731416A (en) * | 2005-08-04 | 2006-02-08 | 上海交通大学 | Method of quick and accurate human face feature point positioning |
CN105139004A (en) * | 2015-09-23 | 2015-12-09 | 河北工业大学 | Face expression identification method based on video sequences |
CN105976809A (en) * | 2016-05-25 | 2016-09-28 | 中国地质大学(武汉) | Voice-and-facial-expression-based identification method and system for dual-modal emotion fusion |
CN107704810A (en) * | 2017-09-14 | 2018-02-16 | 南京理工大学 | A kind of expression recognition method suitable for medical treatment and nursing |
CN108682431A (en) * | 2018-05-09 | 2018-10-19 | 武汉理工大学 | A kind of speech-emotion recognition method in PAD three-dimensionals emotional space |
-
2018
- 2018-10-30 CN CN201811272233.1A patent/CN109409296B/en not_active Expired - Fee Related
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1731416A (en) * | 2005-08-04 | 2006-02-08 | 上海交通大学 | Method of quick and accurate human face feature point positioning |
CN105139004A (en) * | 2015-09-23 | 2015-12-09 | 河北工业大学 | Face expression identification method based on video sequences |
CN105976809A (en) * | 2016-05-25 | 2016-09-28 | 中国地质大学(武汉) | Voice-and-facial-expression-based identification method and system for dual-modal emotion fusion |
CN107704810A (en) * | 2017-09-14 | 2018-02-16 | 南京理工大学 | A kind of expression recognition method suitable for medical treatment and nursing |
CN108682431A (en) * | 2018-05-09 | 2018-10-19 | 武汉理工大学 | A kind of speech-emotion recognition method in PAD three-dimensionals emotional space |
Non-Patent Citations (4)
Title |
---|
CHIEN SHING OOI.ET AL: ""A new approach of audio emotion recognition"", 《ELSEVIER》 * |
TARUN KRISHNA.ET AL: "Emotion recognition using facial and audio features", 《ICMI "13: PROCEEDINGS OF THE 15TH ACM ON INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION》 * |
卢官明等: ""基于LBP-TOP特征的微表情识别"", 《南京邮电大学学报(自然科学版)》 * |
韩志艳: "《 面向语音与面部表情信号的多模式情感识别技术研究》", 31 January 2017 * |
Cited By (29)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109961054A (en) * | 2019-03-29 | 2019-07-02 | 山东大学 | It is a kind of based on area-of-interest characteristic point movement anxiety, depression, angry facial expression recognition methods |
CN110363074A (en) * | 2019-06-03 | 2019-10-22 | 华南理工大学 | One kind identifying exchange method for complicated abstract class of things peopleization |
CN110414335A (en) * | 2019-06-20 | 2019-11-05 | 北京奇艺世纪科技有限公司 | Video frequency identifying method, device and computer readable storage medium |
CN110443143A (en) * | 2019-07-09 | 2019-11-12 | 武汉科技大学 | The remote sensing images scene classification method of multiple-limb convolutional neural networks fusion |
CN112308102A (en) * | 2019-08-01 | 2021-02-02 | 北京易真学思教育科技有限公司 | Image similarity calculation method, calculation device, and storage medium |
CN110677598A (en) * | 2019-09-18 | 2020-01-10 | 北京市商汤科技开发有限公司 | Video generation method and device, electronic equipment and computer storage medium |
CN110909613A (en) * | 2019-10-28 | 2020-03-24 | Oppo广东移动通信有限公司 | Video character recognition method and device, storage medium and electronic equipment |
CN110909613B (en) * | 2019-10-28 | 2024-05-31 | Oppo广东移动通信有限公司 | Video character recognition method and device, storage medium and electronic equipment |
CN111144197A (en) * | 2019-11-08 | 2020-05-12 | 宇龙计算机通信科技(深圳)有限公司 | Human identification method, device, storage medium and electronic equipment |
CN111178389B (en) * | 2019-12-06 | 2022-02-11 | 杭州电子科技大学 | Multi-mode depth layered fusion emotion analysis method based on multi-channel tensor pooling |
CN111178389A (en) * | 2019-12-06 | 2020-05-19 | 杭州电子科技大学 | Multi-mode depth layered fusion emotion analysis method based on multi-channel tensor pooling |
CN111553311A (en) * | 2020-05-13 | 2020-08-18 | 吉林工程技术师范学院 | Micro-expression recognition robot and control method thereof |
CN111798874A (en) * | 2020-06-24 | 2020-10-20 | 西北师范大学 | Voice emotion recognition method and system |
CN112101462A (en) * | 2020-09-16 | 2020-12-18 | 北京邮电大学 | Electromechanical device audio-visual information fusion method based on BMFCC-GBFB-DNN |
CN112101462B (en) * | 2020-09-16 | 2022-04-19 | 北京邮电大学 | Electromechanical device audio-visual information fusion method based on BMFCC-GBFB-DNN |
CN112418095A (en) * | 2020-11-24 | 2021-02-26 | 华中师范大学 | Facial expression recognition method and system combined with attention mechanism |
CN112418095B (en) * | 2020-11-24 | 2023-06-30 | 华中师范大学 | Facial expression recognition method and system combined with attention mechanism |
CN112488219A (en) * | 2020-12-07 | 2021-03-12 | 江苏科技大学 | Mood consolation method and system based on GRU and mobile terminal |
CN112766112A (en) * | 2021-01-08 | 2021-05-07 | 山东大学 | Dynamic expression recognition method and system based on space-time multi-feature fusion |
CN112766112B (en) * | 2021-01-08 | 2023-01-17 | 山东大学 | Dynamic expression recognition method and system based on space-time multi-feature fusion |
CN114005153A (en) * | 2021-02-01 | 2022-02-01 | 南京云思创智信息科技有限公司 | Real-time personalized micro-expression recognition method for face diversity |
CN112949560B (en) * | 2021-03-24 | 2022-05-24 | 四川大学华西医院 | Method for identifying continuous expression change of long video expression interval under two-channel feature fusion |
CN112949560A (en) * | 2021-03-24 | 2021-06-11 | 四川大学华西医院 | Method for identifying continuous expression change of long video expression interval under two-channel feature fusion |
CN113076847A (en) * | 2021-03-29 | 2021-07-06 | 济南大学 | Multi-mode emotion recognition method and system |
CN113065449A (en) * | 2021-03-29 | 2021-07-02 | 济南大学 | Face image acquisition method and device, computer equipment and storage medium |
CN113111789A (en) * | 2021-04-15 | 2021-07-13 | 山东大学 | Facial expression recognition method and system based on video stream |
CN113128399A (en) * | 2021-04-19 | 2021-07-16 | 重庆大学 | Speech image key frame extraction method for emotion recognition |
CN117577140A (en) * | 2024-01-16 | 2024-02-20 | 北京岷德生物科技有限公司 | Speech and facial expression data processing method and system for cerebral palsy children |
CN117577140B (en) * | 2024-01-16 | 2024-03-19 | 北京岷德生物科技有限公司 | Speech and facial expression data processing method and system for cerebral palsy children |
Also Published As
Publication number | Publication date |
---|---|
CN109409296B (en) | 2020-12-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109409296A (en) | The video feeling recognition methods that facial expression recognition and speech emotion recognition are merged | |
CN108805087B (en) | Time sequence semantic fusion association judgment subsystem based on multi-modal emotion recognition system | |
CN108877801B (en) | Multi-turn dialogue semantic understanding subsystem based on multi-modal emotion recognition system | |
CN108805089B (en) | Multi-modal-based emotion recognition method | |
CN108805088B (en) | Physiological signal analysis subsystem based on multi-modal emotion recognition system | |
CN108717856B (en) | Speech emotion recognition method based on multi-scale deep convolution cyclic neural network | |
Datcu et al. | Semantic audiovisual data fusion for automatic emotion recognition | |
Chen et al. | K-means clustering-based kernel canonical correlation analysis for multimodal emotion recognition in human–robot interaction | |
CN109614895A (en) | A method of the multi-modal emotion recognition based on attention Fusion Features | |
Tawari et al. | Face expression recognition by cross modal data association | |
Yang et al. | Feature augmenting networks for improving depression severity estimation from speech signals | |
CN113158727A (en) | Bimodal fusion emotion recognition method based on video and voice information | |
Ocquaye et al. | Dual exclusive attentive transfer for unsupervised deep convolutional domain adaptation in speech emotion recognition | |
CN106297825A (en) | A kind of speech-emotion recognition method based on integrated degree of depth belief network | |
Alshamsi et al. | Automated facial expression and speech emotion recognition app development on smart phones using cloud computing | |
Liang | Intelligent emotion evaluation method of classroom teaching based on expression recognition | |
Byun et al. | Human emotion recognition based on the weighted integration method using image sequences and acoustic features | |
Jaratrotkamjorn et al. | Bimodal emotion recognition using deep belief network | |
Huijuan et al. | Coarse-to-fine speech emotion recognition based on multi-task learning | |
Atkar et al. | Speech emotion recognition using dialogue emotion decoder and CNN Classifier | |
Veni et al. | Feature fusion in multimodal emotion recognition system for enhancement of human-machine interaction | |
Chelali | Bimodal fusion of visual and speech data for audiovisual speaker recognition in noisy environment | |
Li et al. | A novel art gesture recognition model based on two channel region-based convolution neural network for explainable human-computer interaction understanding | |
Datcu et al. | Multimodal recognition of emotions in car environments | |
Fu et al. | An adversarial training based speech emotion classifier with isolated gaussian regularization |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20201201 |