CN109409296B - Video emotion recognition method integrating facial expression recognition and voice emotion recognition - Google Patents
Video emotion recognition method integrating facial expression recognition and voice emotion recognition Download PDFInfo
- Publication number
- CN109409296B CN109409296B CN201811272233.1A CN201811272233A CN109409296B CN 109409296 B CN109409296 B CN 109409296B CN 201811272233 A CN201811272233 A CN 201811272233A CN 109409296 B CN109409296 B CN 109409296B
- Authority
- CN
- China
- Prior art keywords
- recognition
- formula
- face
- features
- frame sequence
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
- 230000008909 emotion recognition Effects 0.000 title claims abstract description 103
- 230000008921 facial expression Effects 0.000 title claims abstract description 82
- 238000000034 method Methods 0.000 title claims abstract description 81
- 230000008569 process Effects 0.000 claims abstract description 26
- 230000001815 facial effect Effects 0.000 claims abstract description 14
- 230000014509 gene expression Effects 0.000 claims abstract description 13
- 230000000750 progressive effect Effects 0.000 claims abstract description 12
- 230000008451 emotion Effects 0.000 claims description 46
- 210000001508 eye Anatomy 0.000 claims description 31
- 238000000605 extraction Methods 0.000 claims description 26
- 210000004709 eyebrow Anatomy 0.000 claims description 25
- 230000004927 fusion Effects 0.000 claims description 22
- 230000000007 visual effect Effects 0.000 claims description 22
- 230000007935 neutral effect Effects 0.000 claims description 18
- 238000012216 screening Methods 0.000 claims description 15
- 238000001514 detection method Methods 0.000 claims description 13
- 210000001331 nose Anatomy 0.000 claims description 12
- 238000012549 training Methods 0.000 claims description 12
- 238000004364 calculation method Methods 0.000 claims description 9
- 238000009432 framing Methods 0.000 claims description 9
- 239000000284 extract Substances 0.000 claims description 8
- 238000007781 pre-processing Methods 0.000 claims description 7
- 206010063659 Aversion Diseases 0.000 claims description 6
- 238000012360 testing method Methods 0.000 claims description 6
- 230000008859 change Effects 0.000 claims description 4
- 230000035945 sensitivity Effects 0.000 claims description 4
- NAWXUBYGYWOOIX-SFHVURJKSA-N (2s)-2-[[4-[2-(2,4-diaminoquinazolin-6-yl)ethyl]benzoyl]amino]-4-methylidenepentanedioic acid Chemical compound C1=CC2=NC(N)=NC(N)=C2C=C1CCC1=CC=C(C(=O)N[C@@H](CC(=C)C(O)=O)C(O)=O)C=C1 NAWXUBYGYWOOIX-SFHVURJKSA-N 0.000 claims description 3
- 241000405217 Viola <butterfly> Species 0.000 claims description 3
- 230000004913 activation Effects 0.000 claims description 3
- 244000221110 common millet Species 0.000 claims description 3
- 210000000214 mouth Anatomy 0.000 claims description 3
- 210000002569 neuron Anatomy 0.000 claims description 3
- 210000000056 organ Anatomy 0.000 claims description 3
- 238000012706 support-vector machine Methods 0.000 claims description 3
- 230000007547 defect Effects 0.000 abstract description 3
- 238000012545 processing Methods 0.000 abstract description 2
- 230000006870 function Effects 0.000 description 12
- 238000013528 artificial neural network Methods 0.000 description 6
- 230000000694 effects Effects 0.000 description 5
- 238000010586 diagram Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 4
- 238000011160 research Methods 0.000 description 4
- 238000013459 approach Methods 0.000 description 3
- 238000002474 experimental method Methods 0.000 description 3
- 238000002372 labelling Methods 0.000 description 3
- 230000007246 mechanism Effects 0.000 description 3
- 238000007500 overflow downdraw method Methods 0.000 description 3
- 230000009286 beneficial effect Effects 0.000 description 2
- 230000002902 bimodal effect Effects 0.000 description 2
- 239000002131 composite material Substances 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 238000010606 normalization Methods 0.000 description 2
- 230000036544 posture Effects 0.000 description 2
- 230000033764 rhythmic process Effects 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000000295 complement effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 238000004904 shortening Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/174—Facial expression recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/168—Feature extraction; Face representation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/16—Human faces, e.g. facial parts, sketches or expressions
- G06V40/172—Classification, e.g. identification
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/57—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for processing of video signals
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/63—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- Multimedia (AREA)
- Human Computer Interaction (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Oral & Maxillofacial Surgery (AREA)
- Signal Processing (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Computational Linguistics (AREA)
- Acoustics & Sound (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- Child & Adolescent Psychology (AREA)
- Hospice & Palliative Care (AREA)
- Psychiatry (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Image Analysis (AREA)
Abstract
The invention relates to a video emotion recognition method integrating facial expression recognition and voice emotion recognition, which relates to the processing of a recording carrier for recognizing graphs and is a decision-level-based audio-visual emotion recognition method with two progressive processes, wherein the method separates the facial expression recognition and the voice emotion recognition in a video, and adopts the two progressive emotion recognition methods to perform the voice emotion recognition on the basis of the facial expression recognition by calculating the conditional probability; the method comprises the following steps: the facial image expression recognition is used as first classification recognition; step B, taking the speech emotion recognition as second classification recognition; and C, fusing the facial expression recognition and the voice emotion recognition. The invention overcomes the defects that the prior art ignores the internal relation between the human face characteristic and the voice characteristic in the human emotion recognition, and the video emotion recognition has low recognition speed and low recognition rate.
Description
Technical Field
The technical scheme of the invention relates to processing of a record carrier for recognizing graphs, in particular to a video emotion recognition method fusing facial expression recognition and voice emotion recognition.
Background
With the rapid development of artificial intelligence and computer vision technology, man-machine interaction technology is changing day by day, and human emotion recognition technology by using a computer has received extensive attention, so that how to enable the computer to more quickly and accurately recognize human emotion becomes a research hotspot in the field of machine vision at present.
Human emotion expression modes are various and mainly include facial expressions, speech emotions, upper body postures, language texts and the like. Among them, facial expression and speech emotion are two most typical emotion expression modes. Because the texture and the geometric features of the human face are easy to extract, the emotion recognition method based on the human face expression can achieve higher recognition rate in the current emotion recognition field. However, in some cases where the expressions are similar, such as anger and dislike, fear and surprise, their texture features and geometric features are similar, and the recognition rate is not high only by extracting the features of the facial expressions.
The single-modal emotion recognition method is often limited, and the bi-modal or multi-modal emotion recognition method is more and more a hotspot for research and attention in the emotion recognition research field. The key of the multi-modal emotion recognition method lies in a multi-modal fusion mode, wherein the mainstream fusion mode comprises a feature level fusion mode and a decision level fusion mode.
In 2012, Schuller et al cascaded audio and video features into a single feature vector in the "AVEC: the connected audio/visual annotation challenge", and used support vector regression SVR as a baseline in the AVEC 2012 challenge, and this feature-level fusion method directly cascaded multi-modal features to construct a combined feature vector. Because the large number of multi-modal features is likely to cause dimension disasters, high-dimensional features are easily plagued by the problem of data sparsity, and the advantages of combining audio features and video features in a feature level fusion mode are limited in consideration of the interaction between the features.
The decision-level fusion mode means that a plurality of emotion expression modes can be modeled by corresponding classifiers firstly, then the recognition results of each classifier are fused together to form a decision-level fusion mode, and different modes are combined through the contribution degrees of different emotion expressions under the condition of not increasing the dimension. Seng et al in the paper "A combined rule-based and machine learning audio-visual learning approach" divides the audiovisual emotion recognition into two mutually independent paths to extract features respectively, then models on respective classifiers to obtain respective corresponding recognition rates, and finally obtains the final recognition rate according to a proportional scoring mechanism and corresponding weight distribution. The existing decision-level fusion method has two main disadvantages, namely: the proportion scoring mechanism and the weight distribution strategy lack a uniform authoritative standard, and different researchers often obtain different identification results on the same research project according to various proportion scoring mechanisms and different weight distribution strategies; secondly, the method comprises the following steps: the decision-level fusion mode focuses on the fusion of the face recognition result and the voice recognition result, and ignores the internal relation between the face feature and the voice feature.
CN106529504A discloses a bimodal video emotion recognition method of composite space-time characteristics, which expands the existing volume local binary pattern algorithm into a space-time ternary pattern, acquires space-time local ternary pattern moment texture characteristics of human face expression and upper body posture, further fuses three-dimensional gradient direction histogram characteristics to enhance description of emotion video, and combines the two characteristics into composite space-time characteristics.
CN105512609A discloses a multi-mode fusion video emotion recognition method based on a kernel-based ultralimit learning machine, which is used for carrying out feature extraction and feature selection on image information and audio information of a video so as to obtain video features; preprocessing, feature extraction and feature selection are carried out on the acquired multi-channel electroencephalogram signals, so that electroencephalogram features are obtained; establishing a multi-mode fusion video emotion recognition model based on a kernel overrun learning machine; and inputting the video characteristics and the electroencephalogram characteristics into a multi-mode fusion video emotion recognition model based on a kernel-based over-limit learning machine to perform video emotion recognition, and obtaining the final classification accuracy. However, the algorithm has the defects of high classification recognition rate and low usability only for three types of video emotion data.
CN103400145B discloses a speech-vision fusion emotion recognition method based on clue neural network, which firstly uses the feature data of three channels of human face expression, side face expression and speech to train a neural network independently to execute the recognition of discrete emotion categories, 4 clue nodes are added in the output layer of the neural network model in the training process to respectively bear clue information of 4 coarse-grained categories in activity-evaluation degree space, then a multi-mode fusion model is used to fuse the output results of the three neural networks, the multi-mode fusion model also adopts the neural network trained based on the clue information, however, in most videos, the number of frames of facial side face expression frames is small, and effective collection is difficult to carry out, so that the method has great limitation on the practical operation. The method also relates to training and fusion of the neural network, and as the data volume increases and the data dimension increases, the consumption of training time and resources gradually increases, and the error rate also gradually increases.
CN105138991B discloses a video emotion recognition method based on emotion significance feature fusion, which extracts audio features and visual emotion features from each video shot in a training video set, wherein the audio features form emotion distribution histogram features based on a word bag model; the visual emotion characteristics form emotion attention characteristics based on a visual dictionary, and the emotion attention characteristics and the emotion distribution histogram characteristics are fused from top to bottom to form video characteristics with emotion significance. The method only extracts the characteristics of the video key frames when extracting the visual emotion characteristics, and ignores the association relationship of the characteristics between the video frames to a certain extent.
Disclosure of Invention
The technical problem to be solved by the invention is as follows: the invention provides a video emotion recognition method fusing facial expression recognition and voice emotion recognition, which is a decision-level-based two-process progressive audiovisual emotion recognition method.
The technical scheme adopted by the invention for solving the technical problem is as follows: a video emotion recognition method fusing facial expression recognition and voice emotion recognition is a decision-level-based two-process progressive audiovisual emotion recognition method, and specifically comprises the following steps:
process a. facial image expression recognition is used as a first classification recognition:
the process A comprises the steps of extracting facial expression characteristics, grouping facial expressions and first-time classification of facial expression recognition, and comprises the following steps:
firstly, video frame extraction and voice signal extraction are carried out on a video signal:
decomposing the video in the database into an image frame sequence, performing video frame extraction by utilizing open-source FormatFactory software, and extracting and storing the voice signal in the video into an MP3 format;
secondly, preprocessing of the image frame sequence and the voice signal:
positioning and cutting the human face of the image frame sequence obtained in the first step by using a disclosed Viola & Jones algorithm, and normalizing the size of the cut human face image into M multiplied by M pixels to obtain an image frame sequence with the normalized size of the human face image;
carrying out voice detection on the voice signals obtained in the first step by utilizing a known voice endpoint detection algorithm VAD and removing noise and silence segments to obtain voice signals with characteristics easier to extract;
thereby completing the pre-processing of the image frame sequence and the voice signal;
thirdly, marking the human face characteristic points in the image frame sequence and screening key frames in the image frame sequence:
carrying out face T feature point marking on the image frame sequence with the second step of face image size normalization, wherein T is 68, the positions of the 68 feature points are known, the marked feature point outlines are respectively in the eyes, eyebrows, noses and mouth regions of the face image, and the following 6 specific distances are calculated for the u frame image in the image frame sequence with the second step of face image size normalization according to the coordinates of the T feature points:
the distance between the eyes and the eyebrows in the vertical direction is Du,1:Du,1=dvertical||p22,p40||,
The distance in the vertical direction of the opening of the eye is Du,2:Du,2=dvertical||p45,p47||,
The distance between the eyes and the mouth in the vertical direction is Du,3:Du,3=dvertical||p37,p49||,
The distance D in the vertical direction between the nose and the mouthu,4:Du,4=dvertical||p34,p52||,
The distance in the vertical direction of the upper and lower lips is Du,5:Du,5=dvertical||p52,p58||,
The two sides of the mouth have a width distance D in the horizontal directionu,6:Du,6=dhorizontal||p49,p55||,
And is provided with
dvertical||pi,pj||=|pj,y-pi,y|,dhorizontal||pi,pj||=|pj,x-pi,x| (1),
In the formula (1), piIs the coordinate set of the ith feature point, pjIs the coordinate set of the jth feature point, pi,yIs the ordinate, p, of the i-th feature pointj,yIs the ordinate of the jth feature point, pi,xIs the abscissa, p, of the ith feature pointj,xIs the abscissa of the jth feature point, dvertical||pi,pjI is the vertical distance between feature points i and j, dhorizontal||pi,pjI | is the horizontal distance between feature points i and j, i ═ 1,2, …,68, j ═ 1,2, …, 68;
setting the first frame in the image frame sequence with normalized human face image size in the second step as a neutral frame, and the set V of 6 specific distances0As shown in the formula (2),
V0=[D0,1,D0,2,D0,3,D0,4,D0,5,D0,6] (2),
in the formula (2), D0,1,D0,2,D0,3,D0,4,D0,5And D 0,66 specific distances corresponding to neutral frames in the image frame sequence with normalized face image size in the second step;
the set V of 6 specific distances of the u frame in the image frame sequence with normalized face image size in the second stepuAs shown in the formula (3),
Vu=[Du,1,Du,2,Du,3,Du,4,Du,5,Du,6] (3),
in the formula (3), u is 1,2, …, K-1, where K is the number of face images in the group of image frame sequences with the face image size normalized in the second step, and Du,1,Du,2,Du,3,Du,4,Du,5,D u,66 specific distances corresponding to the u frame in the image frame sequence with normalized face image size in the second step;
the sum of the 6 corresponding specific distance ratios of the u-th frame and the neutral frame in the image frame sequence with normalized face image size in the second step is shown in formula (4),
in formula (4), DFuRepresenting the sum of the ratios of 6 specific distances corresponding to the neutral frame image and the u frame image in the image frame sequence with the normalized face image size in the second step, n represents the number of 6 specific distances, D0,nRepresenting the n-th specific distance, D, corresponding to the neutral frame in the image frame sequence with normalized face image size in the second stepu,nRepresenting the nth specific distance corresponding to the u frame in the image frame sequence with the normalized face image size in the second step;
in the image frame sequence with normalized face image size in the second step, the ratio DF of the specific distance corresponding to each frame image in the image frame sequence is obtained according to the formula (2), the formula (3) and the formula (4)uScreening for maximum DFuThe corresponding u-th frame image is the key frame in the image frame sequence,
thus, marking the facial feature points of the image frame sequence and screening key frames in the image frame sequence;
fourthly, extracting the texture features of the human face:
extracting human face texture features by using an LBP-TOP algorithm, firstly, dividing an image frame sequence with the normalized human face image size in the second step into XY, XT and YT orthogonal planes in space-time, calculating LBP values of central pixel points in 3 x 3 neighborhoods in each orthogonal plane, counting LBP histogram features of the three orthogonal planes, and finally connecting the LBP histograms of the three orthogonal planes to form an integral feature vector, wherein the LBP operator calculation method is shown as a formula (5) and a formula (6),
in the formula (5) and the formula (6), Z is the number of neighborhood points of the central pixel point, R is the distance between the neighborhood points and the central pixel point, and t iscPixel value, t, of the central pixel pointqIs the pixel value of the q-th neighborhood point, Sig (t)q-tc) The LBP coding value for the qth neighborhood point,
the LBP-TOP histogram is defined as shown in equation (7),
in formula (7), b is the number of the plane, b is 0 and XY plane, b is 1 and XT plane, b is 2 and YT plane, n isbIs the number of binary patterns produced by the LBP operator on the b-th plane, I { LBP }Z,R,b(x, y, t) ═ a } employing LBP in the (b) th planeZ,RWhen the operator carries out feature extraction, the number of pixel points with the LBP coding value of a is counted;
fifthly, extracting geometric features of the human face:
calculating coordinates of T feature points marked in the key frame according to the key frame in the screened image frame sequence obtained in the third step to obtain geometric features of the human face expression, wherein in the field of human face expression recognition, the region with the most abundant human face features is a human face T-shaped region and mainly comprises eyebrow, eye, nose, chin and mouth regions, so that the extraction method of the geometric features of the human face mainly extracts distance features between the mark points of the human face T-shaped region;
and 5.1, calculating the Euclidean distance characteristics of the face characteristic point pairs:
selecting 14 pairs of feature points of eyebrows, 12 pairs of feature points of eyes, 12 pairs of feature points of mouths, 6 pairs of feature points of noses and 6 pairs of feature points of chins from T feature points in key frames in the screening image frame sequence obtained in the third step, totaling 50 pairs of feature points, calculating Euclidean distance between feature point pairs A and B, totaling 50 dimension Euclidean distance features, and marking as G50Calculating characteristicsEquation (8) characterizing the euclidean distance between points a and B is shown below,
in the formula (8), pAIs a set of coordinates, p, of the feature point ABIs a set of coordinates, p, of the feature point BA,xIs the abscissa, p, of the feature point AA,yIs the ordinate, p, of the characteristic point AB,xIs the abscissa, p, of the feature point BB,yIs the ordinate of the characteristic point B;
and 5.2, calculating the angle characteristics of the face characteristic points:
selecting 10 angles representing face feature changes from the T feature points of the key frame obtained by screening in the third step, wherein 2 angles of eyebrows, 6 angles of eyes and 2 angles of mouths are calculated, extracting angle features, and 10-dimensional angle features in total are recorded as Q10Equation (9) for calculating the angle of the feature point is as follows,
in formula (9), pC、pD、pEThree feature point coordinate sets corresponding to the angles formed by the eyebrow, the eye and the mouth regions marked on the face feature points by the third step, wherein p isDA vertex point coordinate set;
and 5.3, calculating the area characteristics of the face region:
selecting 5 areas of the face image, including left and right eyebrows, two eyes and a mouth, respectively calculating the area characteristics of the 5 areas, and correspondingly subtracting the area of the 5 face areas extracted by the key frame from the area of the 5 face areas extracted by the neutral frame to obtain the change characteristics of the area of the face image area due to the difference of the sizes of the face organs of each person, wherein the total 5 dimensions are recorded as O5Setting the eyebrow area, mouth area and eye area of human face as triangles, and calculating each triangle surface by using Helen formulaProduct, the Euclidean distance characteristic G of the face characteristic point pair50Angular characteristic Q of human face characteristic points10And face region area feature O5The geometric features F combined as a face are shown in formula (10),
F=[G50 Q10 O5] (10),
at this point, the facial texture features and the facial geometric features are connected in series to complete the extraction of the facial expression features;
sixthly, grouping the facial expressions:
the six emotions of the human face are: surprise, fear, anger, aversion, happiness and sadness, which are divided into three groups in pairs, wherein the groups are as follows:
a first group: surprise and fear; second group: angry and aversion; third group: happy and sad;
seventhly, classifying the facial expression for the first time:
putting the facial expression features extracted in the fourth step and the fifth step into an ELM classifier for training and testing, thereby finishing the first classification of facial expression recognition and obtaining the recognition result of the first classification of the facial expression recognition, wherein the parameters of the ELM are set as: ELM type: "classification", number of hidden layer neurons: "20", activation function: a "Sigmoid" function;
and B, taking the speech emotion recognition as a second classification recognition:
in the process B, based on the facial expression recognition result in the process a, the speech features are combined, and the speech emotion feature extraction and the second classification of the speech emotion recognition are performed on each of the three groups in the sixth facial expression grouping, specifically, the operations are as follows:
and eighth, extracting the speech emotion characteristics:
and aiming at the classification result of the first classification of the facial expression recognition in the seventh step, according to the grouping in the sixth step, different prosodic features are respectively extracted according to the different sensitivity degrees of the emotions of each group to different audio prosodic features:
a first group: extracting zero crossing rate ZCR and logarithmic energy LogE,
second group: extracting a Teager energy operator TEO, a zero-crossing rate ZCR and a logarithmic energy LogE,
third group: extracting Pitch Pitch, zero-crossing rate ZCR and Teager energy operator TEO,
the Pitch height Pitch of the prosodic feature described above is calculated in the frequency domain,
for the speech signal M preprocessed in the above-described second step, the Pitch Pitch is calculated by the following equation (11),
in equation (11), Pitch is Pitch, DFT is a discrete Fourier transform function, LMWhich represents the length of the speech signal and,representing the speech signal plus a hamming window,is calculated as shown in the following equation (12),
in the formula (12), N is the number of Hamming windows, and m is the mth Hamming window;
the zero crossing rate ZCR in the prosodic feature is calculated as shown in equation (13),
in formula (13), ZCR represents the average zero-crossing rate of N windows, | | is the absolute value sign, x (m) is the speech signal of the mth window after framing and windowing, sgn { x (m) } function judges the sign of the speech amplitude, sgn { x (m) } function is calculated by formula (14),
in formula (14), x (m) is the speech signal of the mth window after the framing windowing;
the above log energy LogE is calculated as in equation (15),
in formula (15), LogE represents the total logarithmic energy of N windows, x (m) is the speech signal of the mth window after framing and windowing, and N is the number of windows;
the Teager energy operator TEO is defined as shown in equation (16),
in formula (16), ψ [ X (m)]The Teager energy operator TEO for the mth window, X' (m) ═ dX (m)/dm, X "(m) ═ dX2(m)/dm2For signals of constant amplitude and frequency: x (m) is a cos (Φ m + θ), where a is the signal amplitude, Φ is the signal frequency, θ is the signal initial phase angle,
extracting well-known Mel Frequency Cepstrum Coefficients (MFCCs) and first-order difference features and second-order difference features of the well-known Mel Frequency Cepstrum Coefficients (MFCCs) and the first-order difference features of the audio files corresponding to the image frames of each of the three groups in the sixth step of grouping of the facial expressions, and finally connecting the extracted prosody features of each group, the corresponding Mel Frequency Cepstrum Coefficients (MFCCs) and the first-order difference features and the second-order difference features of the prosody features in series to form mixed audio features,
thereby completing the extraction of the speech emotion characteristics;
and ninthly, classifying for the second time of the speech emotion recognition:
putting the speech emotion characteristics extracted in the eighth step into an SVM for training and testing, and finally obtaining the recognition rate of speech emotion recognition, wherein the parameters of the SVM are set as: penalty coefficient: "95", allowing redundant outputs: "0", nuclear parameters: "1", kernel function of support vector machine: the expression "Gaussian nucleus",
thereby completing the second classification of the speech emotion recognition;
and C, fusing facial expression recognition and voice emotion recognition:
and step ten, fusing the facial expression recognition and the voice emotion recognition on a decision level:
since the speech emotion recognition in the process B is a secondary recognition performed on the basis of the face emotion recognition in the process a, the relationship between the two recognition rates belongs to a conditional probability relationship, and the final recognition rate P (Audio _ Visual) calculation method is as shown in formula (17),
P(Audio_Visual)=P(Visual)×P(Audio|Visual) (17),
in formula (17), P (Visual) is the recognition rate of the first face image recognition, and P (Audio/Visual) is the recognition rate of the second speech emotion;
and finishing the progressive video emotion recognition based on two decision-level processes, namely facial expression recognition and voice emotion recognition fusion.
In the video emotion recognition method combining facial expression recognition and speech emotion recognition, in the third step, the coordinates of the T feature points are determined, where T is 68.
In the video emotion recognition method combining facial expression recognition and Voice emotion recognition, the Voice endpoint Detection algorithm is called Voice Activity Detection (Voice Activity Detection) in english, and is abbreviated as VAD; the English of Zero-Crossing Rate is Zero-Crossing Rate, abbreviated as ZCR; english of logarithmic energy is LogEnergy, abbreviated LogE; the Mel frequency cepstral coefficients are Mel-frequency cepstral coefficients in English, and are abbreviated as MFCC; the Teager Energy Operator is called Tea in English, and the speech endpoint detection algorithm, zero-crossing rate, logarithmic Energy, mel-frequency cepstrum coefficient and the Teager Energy Operator are well known in the technical field.
The above video emotion recognition method combining facial expression recognition and speech emotion recognition is a calculation operation method that can be grasped by those skilled in the art.
The invention has the beneficial effects that: compared with the prior art, the invention has the prominent substantive characteristics and remarkable progress as follows:
(1) the invention provides a video emotion recognition method fusing facial expression recognition and voice emotion recognition, which is an audio-visual emotion recognition method based on decision level, the method separates the facial expression recognition and the voice emotion recognition in the video, adopts a progressive emotion recognition method of two processes, and adopts a conditional probability calculation mode, the speech emotion recognition technology which is carried out on the basis of the facial expression recognition fully considers the influence of the result of the facial expression recognition on the speech emotion recognition, the human face expression recognition and the voice emotion recognition are fused more closely and mutually assisted, so that a more ideal human emotion recognition effect is obtained, and the defects that in the prior art, the internal relation between the human face features and the voice features is ignored in the human emotion recognition, the recognition speed of the video emotion recognition is low and the recognition rate is low are overcome.
(2) Chien et al have an experiment of 'audio characteristic analysis' in a 'new approach of audio observation recognition' in 2014, and the experiment demonstrates that 6 emotions have different sensitivity degrees on rhythm characteristics, namely Pitch, Zero-Crossing Rate, Loginergy and Teager Energy Operator. The paper classifies Mel-Frequency Cepstral Coefficients, MFCC extracted from audio by using an SVM classifier, the recognition rate is gradually reduced when two-classification, four-classification and six-classification are carried out, the classification effect of the classifier is better when the classified classes are less, therefore, in the invention, three classifications are selected during the first face expression classification, and the second classification is selected during the second audio classification. The method simplifies the multi-classification problem into a three-classification problem and a two-classification problem, thereby reducing the feature dimension, shortening the training time and greatly improving the efficiency of the algorithm.
(3) Compared with CN106529504A, the method of the invention has the advantages that the method not only extracts the face characteristics in the video, but also extracts the audio characteristics in the video, and the bimodal combination of the face characteristics and the audio characteristics is beneficial to more accurately identifying the emotion of the person in the video.
(4) Compared with CN105512609A, the method provided in CN105512609A can only identify three emotions in a video, but the method can identify six emotions in the video, and the average identification rate of the method is 9.92% higher than that of the video in CN 105512609A.
(5) Compared with CN105138991A, the method of the invention has the advantages of classifying the human face features and the voice features respectively, avoiding the 'dimension disaster' easily caused by feature level fusion, along with simple operation and faster training and recognition speed in the fusion method at decision level.
(6) When the audio features are extracted, the different sensibilities of different emotions to different audio features are considered, so that different audio features are extracted from each group, and the second classification based on the voice features is facilitated.
(7) The method extracts texture, geometry, time and rhythm characteristics, different characteristics reflect different characteristics of expressions, and a classifier can be well trained to perform video emotion recognition from multiple modes.
(8) The invention applies a secondary progressive emotion classification method, mainly uses face recognition and auxiliary voice recognition, and the face recognition and the auxiliary voice recognition are complementary and mutually assisted, thereby realizing more accurate video emotion recognition.
Drawings
The invention is further illustrated with reference to the following figures and examples.
FIG. 1 is a block diagram showing the flow of the method of the present invention.
Fig. 2 is a schematic diagram of the labeling of 6 specific distances and 68 feature points of a human face.
Fig. 3 is a diagram of an example of 68 feature point labels for a face in the eNTERFACE' 05 database.
Detailed Description
The embodiment shown in fig. 1 shows that the process of the method of the present invention comprises: process A. And (B) carrying out the process. And (C) carrying out the process.
The embodiment shown in fig. 2 shows the labels of 68 feature points of the human face and 6 specific distances, which are sequentially the vertical distance between the feature points 22 and 40 and the distance D, and is an exemplary image labeled by the feature pointsu,1The vertical distance between feature points 45 and 47 is denoted by Du,2Indicating that the vertical distance between feature points 37 and 49 is in Du,3The vertical distance between feature points 34 and 52 is denoted by Du,4The vertical distance between feature points 52 and 58 is denoted by Du,5The horizontal distance between feature points 49 and 55 is denoted by Du,6And (4) showing. The connecting lines between the feature points in the figure outline the eyebrow, eye and mouth regions of the human face.
The embodiment shown in fig. 3 shows an example of labeling a face in the eNTERFACE' 05 database with Dlib feature points, where 68 feature points labeled in the figure correspond to the labels of the 68 feature points shown in the schematic diagram of labeling the face feature points in fig. 2.
Example 1
The video emotion recognition method fusing facial expression recognition and voice emotion recognition in the embodiment is a decision-level-based two-process progressive audiovisual emotion recognition method, and specifically comprises the following steps:
process a. facial image expression recognition is used as a first classification recognition:
the process A comprises the steps of extracting facial expression characteristics, grouping facial expressions and first-time classification of facial expression recognition, and comprises the following steps:
firstly, video frame extraction and voice signal extraction are carried out on a video signal:
decomposing the video in the database into an image frame sequence, performing video frame extraction by utilizing open-source FormatFactory software, and extracting and storing the voice signal in the video into an MP3 format;
secondly, preprocessing of the image frame sequence and the voice signal:
positioning and cutting the human face of the image frame sequence obtained in the first step by using a disclosed Viola & Jones algorithm, and normalizing the size of the cut human face image into M multiplied by M pixels to obtain an image frame sequence with the normalized size of the human face image;
carrying out voice detection on the voice signals obtained in the first step by utilizing a known voice endpoint detection algorithm VAD and removing noise and silence segments to obtain voice signals with characteristics easier to extract;
thereby completing the pre-processing of the image frame sequence and the voice signal;
thirdly, marking the human face characteristic points in the image frame sequence and screening key frames in the image frame sequence:
marking the image frame sequence with the normalized size of the face image in the second step with T feature points, wherein the value range of T is 1,2, …,68, the positions of the 68 feature points are known, the outlines of the marked feature points are in the eye, eyebrow, nose and mouth regions of the face image respectively, and in this embodiment, the following 6 specific distances are calculated for the u frame image in the image frame sequence with the normalized size of the face image in the second step according to the coordinates of T ═ 68 feature points:
the distance between the eyes and the eyebrows in the vertical direction is Du,1:Du,1=dvertical||p22,p40||,
The distance in the vertical direction of the opening of the eye is Du,2:Du,2=dvertical||p45,p47||,
The distance between the eyes and the mouth in the vertical direction is Du,3:Du,3=dvertical||p37,p49||,
The distance D in the vertical direction between the nose and the mouthu,4:Du,4=dvertical||p34,p52||,
The distance in the vertical direction of the upper and lower lips is Du,5:Du,5=dvertical||p52,p58||,
The two sides of the mouth have a width distance D in the horizontal directionu,6:Du,6=dhorizontal||p49,p55||,
And is provided with
dvertical||pi,pj||=|pj,y-pi,y|,dhorizontal||pi,pj||=|pj,x-pi,x| (1),
In the formula (1), piIs the coordinate set of the ith feature point, pjIs the coordinate set of the jth feature point, pi,yIs the ordinate, p, of the i-th feature pointj,yIs the ordinate of the jth feature point, pi,xIs the abscissa, p, of the ith feature pointj,xIs the abscissa of the jth feature point, dvertical||pi,pjI is the vertical distance between feature points i and j, dhorizontal||pi,pjI | is the horizontal distance between feature points i and j, i ═ 1,2, …,68, j ═ 1,2, …, 68;
setting the first frame in the image frame sequence with normalized human face image size in the second step as a neutral frame, and the set V of 6 specific distances0As shown in the formula (2),
V0=[D0,1,D0,2,D0,3,D0,4,D0,5,D0,6] (2),
in the formula (2), D0,1,D0,2,D0,3,D0,4,D0,5And D 0,66 specific distances corresponding to neutral frames in the image frame sequence with normalized face image size in the second step;
the set V of 6 specific distances of the u frame in the image frame sequence with normalized face image size in the second stepuAs shown in the formula (3),
Vu=[Du,1,Du,2,Du,3,Du,4,Du,5,Du,6] (3),
in the formula (3), u is 1,2, …, K-1, where K is the number of face images in the group of image frame sequences with the face image size normalized in the second step, and Du,1,Du,2,Du,3,Du,4,Du,5,D u,66 specific distances corresponding to the u frame in the image frame sequence with normalized face image size in the second step;
the sum of the 6 corresponding specific distance ratios of the u-th frame and the neutral frame in the image frame sequence with normalized face image size in the second step is shown in formula (4),
in formula (4), DFuRepresenting the sum of the ratios of 6 specific distances corresponding to the neutral frame image and the u frame image in the image frame sequence with the normalized face image size in the second step, n represents the number of 6 specific distances, D0,nRepresenting the n-th specific distance, D, corresponding to the neutral frame in the image frame sequence with normalized face image size in the second stepu,nRepresenting the nth specific distance corresponding to the u frame in the image frame sequence with the normalized face image size in the second step;
in the image frame sequence with normalized face image size in the second step, each image frame in the image frame sequence is obtained according to the formula (2), the formula (3) and the formula (4)Ratio DF of specific distances corresponding to the imagesuScreening for maximum DFuThe corresponding u-th frame image is the key frame in the image frame sequence,
thus, marking the facial feature points of the image frame sequence and screening key frames in the image frame sequence;
fourthly, extracting the texture features of the human face:
extracting human face texture features by using an LBP-TOP algorithm, firstly, dividing an image frame sequence with the normalized human face image size in the second step into XY, XT and YT orthogonal planes in space-time, calculating LBP values of central pixel points in 3 x 3 neighborhoods in each orthogonal plane, counting LBP histogram features of the three orthogonal planes, and finally connecting the LBP histograms of the three orthogonal planes to form an integral feature vector, wherein the LBP operator calculation method is shown as a formula (5) and a formula (6),
in the formula (5) and the formula (6), Z is the number of neighborhood points of the central pixel point, R is the distance between the neighborhood points and the central pixel point, and t iscPixel value, t, of the central pixel pointqIs the pixel value of the q-th neighborhood point, Sig (t)q-tc) The LBP coding value for the qth neighborhood point,
the LBP-TOP histogram is defined as shown in equation (7),
in formula (7), b is the number of the plane, b is 0 and XY plane, b is 1 and XT plane, b is 2 and YT plane, n isbIs the number of binary patterns produced by the LBP operator on the b-th plane, I { LBP }Z,R,b(x, y, t) ═ a } isUsing LBP in the b-th planeZ,RWhen the operator carries out feature extraction, the number of pixel points with the LBP coding value of a is counted;
fifthly, extracting geometric features of the human face:
calculating coordinates of T feature points marked in the key frame according to the key frame in the screened image frame sequence obtained in the third step to obtain geometric features of the human face expression, wherein in the field of human face expression recognition, the region with the most abundant human face features is a human face T-shaped region and mainly comprises eyebrow, eye, nose, chin and mouth regions, and specific feature points are shown in a table 2, so that the extraction method of the geometric features of the human face mainly extracts distance features between the mark points of the human face T-shaped region;
and 5.1, calculating the Euclidean distance characteristics of the face characteristic point pairs:
selecting 14 pairs of feature points of eyebrows, 12 pairs of feature points of eyes, 12 pairs of feature points of mouths, 6 pairs of feature points of noses and 6 pairs of feature points of chins from T feature points in key frames in the screening image frame sequence obtained in the third step, totaling 50 pairs of feature points, calculating Euclidean distance between feature point pairs A and B, totaling 50 dimension Euclidean distance features, and marking as G50Equation (8) for calculating the euclidean distance between the characteristic point pairs a and B is as follows,
in the formula (8), pAIs a set of coordinates, p, of the feature point ABIs a set of coordinates, p, of the feature point BA,xIs the abscissa, p, of the feature point AA,yIs the ordinate, p, of the characteristic point AB,xIs the abscissa, p, of the feature point BB,yIs the ordinate of the characteristic point B;
table 1 shows the pairs of face feature points to be calculated in the face T-shaped region, where d | | pA,pB| | represents the euclidean distance between pairs of feature points A, B;
TABLE 1
And 5.2, calculating the angle characteristics of the face characteristic points:
selecting 10 angles representing face feature changes from the 68 feature points of the key frame obtained by screening in the third step, wherein 2 angles of eyebrows, 6 angles of eyes and 2 angles of mouths are calculated, extracting angle features, and recording the angle features as Q in 10 dimensions10The specific angle is shown in Table 3, and the formula (9) for calculating the angle of the feature point is as follows,
in formula (9), pC、pD、pEThree feature point coordinate sets corresponding to the angles formed by the eyebrow, the eye and the mouth regions marked on the face feature points by the third step, wherein p isDA vertex point coordinate set;
table 2 shows the angles of the face feature points to be calculated in the face T-shaped region. Wherein Q (p)C,pD,pE) An angular characteristic representing an angle D;
TABLE 2
And 5.3, calculating the area characteristics of the face region:
selecting 5 areas of the face image, including left and right eyebrows, two eyes and a mouth, and respectively calculating the area characteristics of the 5 areas, wherein the specific area areas are shown in a table 3;
TABLE 3
Table 3 shows the area of the region surrounded by the face feature points to be calculated in the face T-shaped region,wherein O (p)A,pB,pC,pD) Represents the area of the region surrounded by the line connecting the characteristic points A, B, C, D;
because of the difference of the sizes of the human face organs of each person, the area of 5 human face regions extracted from the key frame in the table 4 is correspondingly subtracted from the area of 5 human face regions extracted from the neutral frame to obtain the change characteristic of the area of the human face image region, and the total 5 dimensions are marked as O5Setting the eyebrow area, mouth area and eye area of human face as triangles, calculating the area of each triangle by using Helen formula, and determining the Euclidean distance characteristic G of the feature point pair of human face50Angular characteristic Q of human face characteristic points10And face region area feature O5The geometric features F combined as a face are shown in formula (10),
F=[G50 Q10 O5] (10),
at this point, the facial texture features and the facial geometric features are connected in series to complete the extraction of the facial expression features;
sixthly, grouping the facial expressions:
the six emotions of the human face are: surprise, fear, anger, aversion, happiness and sadness, which are divided into three groups in pairs, wherein the groups are as follows:
a first group: surprise and fear; second group: angry and aversion; third group: happy and sad;
seventhly, classifying the facial expression for the first time:
putting the facial expression features extracted in the fourth step and the fifth step into an ELM classifier for training and testing, thereby finishing the first classification of facial expression recognition and obtaining the recognition result of the first classification of the facial expression recognition, wherein the parameters of the ELM are set as: ELM type: "classification", number of hidden layer neurons: "20", activation function: a "Sigmoid" function;
and B, taking the speech emotion recognition as a second classification recognition:
in the process B, based on the facial expression recognition result in the process a, the speech features are combined, and the speech emotion feature extraction and the second classification of the speech emotion recognition are performed on each of the three groups in the sixth facial expression grouping, specifically, the operations are as follows:
and eighth, extracting the speech emotion characteristics:
and aiming at the classification result of the first classification of the facial expression recognition in the seventh step, according to the grouping in the sixth step, different prosodic features are respectively extracted according to the different sensitivity degrees of the emotions of each group to different audio prosodic features:
a first group: extracting zero crossing rate ZCR and logarithmic energy LogE,
second group: extracting a Teager energy operator TEO, a zero-crossing rate ZCR and a logarithmic energy LogE,
third group: extracting Pitch Pitch, zero-crossing rate ZCR and Teager energy operator TEO,
the Pitch height Pitch of the prosodic feature described above is calculated in the frequency domain,
for the speech signal M preprocessed in the above-described second step, the Pitch Pitch is calculated by the following equation (11),
in equation (11), Pitch is Pitch, DFT is a discrete Fourier transform function, LMWhich represents the length of the speech signal and,representing the speech signal plus a hamming window,is calculated as shown in the following equation (12),
in the formula (12), N is the number of Hamming windows, and m is the mth Hamming window;
the zero crossing rate ZCR in the prosodic feature is calculated as shown in equation (13),
in formula (13), ZCR represents the average zero-crossing rate of N windows, | | is the absolute value sign, x (m) is the speech signal of the mth window after framing and windowing, sgn { x (m) } function judges the sign of the speech amplitude, sgn { x (m) } function is calculated by formula (14),
in formula (14), x (m) is the speech signal of the mth window after the framing windowing;
the above log energy LogE is calculated as in equation (15),
in formula (15), LogE represents the total logarithmic energy of N windows, x (m) is the speech signal of the mth window after framing and windowing, and N is the number of windows;
the Teager energy operator TEO is defined as shown in equation (16),
in formula (16), ψ [ X (m)]The Teager energy operator TEO for the mth window, X' (m) ═ dX (m)/dm, X "(m) ═ dX2(m)/dm2For signals of constant amplitude and frequency: x (m) is a cos (Φ m + θ), where a is the signal amplitude, Φ is the signal frequency, θ is the signal initial phase angle,
extracting well-known Mel Frequency Cepstrum Coefficients (MFCCs) and first-order difference features and second-order difference features of the well-known Mel Frequency Cepstrum Coefficients (MFCCs) and the first-order difference features of the audio files corresponding to the image frames of each of the three groups in the sixth step of grouping of the facial expressions, and finally connecting the extracted prosody features of each group, the corresponding Mel Frequency Cepstrum Coefficients (MFCCs) and the first-order difference features and the second-order difference features of the prosody features in series to form mixed audio features,
thereby completing the extraction of the speech emotion characteristics;
and ninthly, classifying for the second time of the speech emotion recognition:
putting the speech emotion characteristics extracted in the eighth step into an SVM for training and testing, and finally obtaining the recognition rate of speech emotion recognition, wherein the parameters of the SVM are set as: penalty coefficient: "95", allowing redundant outputs: "0", nuclear parameters: "1", kernel function of support vector machine: the expression "Gaussian nucleus",
thereby completing the second classification of the speech emotion recognition;
and C, fusing facial expression recognition and voice emotion recognition:
and step ten, fusing the facial expression recognition and the voice emotion recognition on a decision level:
because the speech emotion recognition is the secondary recognition based on the face emotion recognition, the relationship of the two recognition rates belongs to the relationship of the conditional probability, the final recognition rate P (Audio _ Visual) calculation method is shown as the formula (17),
P(Audio_Visual)=P(Visual)×P(Audio|Visual) (17),
in formula (17), P (Visual) is the recognition rate of the first face image recognition, and P (Audio | Visual) is the recognition rate of the second speech emotion;
and finishing the progressive video emotion recognition based on two decision-level processes, namely facial expression recognition and voice emotion recognition fusion.
In this example, the results of comparison experiments with the prior art on the database of eTERFACE' 05 and RML show that the specific recognition rates are shown in Table 4 below:
TABLE 4
The experimental results in Table 4 show the comparison of the recognition rates of the audio visual emotion recognition system in the last few years on the eNewFACE' 05 and RML databases: the average recognition rate of Audiovisual emotion recognition on the eNFERFACE' 05 database in the "Audio recording recognition method and multi-classifier neural networks" document by Mahdi Bejani et al, 2014 was 77.78%;
the average Recognition rate of audiovisual Emotion Recognition performed on the RML database in "Multimodal Deep social Network for Audio-Visual Emotion Recognition" document by Shiqing Zhang et al 2016 is 74.32%;
the average Recognition rates of audiovisual Emotion Recognition by Shiqing Zhang et al 2017 on the eNTERFACE' 05 and RML databases in the "Learning affinity Features with a Hybrid Deep Model for Audio-Visual Emotion Recognition" document are 85.97% and 80.36%, respectively;
the average recognition rates of audiovisual emotion recognition on the eNBACE' 05 and RML databases in "Audio-visual experience fusion (AVEF): A deep visual weighted approach" literature by Yaxiong Ma et al in 2018 were 84.56% and 81.98%, respectively; the decision-level-based two-process progressive audiovisual emotion recognition method adopted by the embodiment has a relatively large improvement in recognition rate compared with a paper in recent years.
In this embodiment, the english of the Voice endpoint Detection algorithm is Voice Activity Detection, abbreviated as VAD, and the english of logarithmic energy is LogEnergy, abbreviated as LogE; the English of Zero-Crossing Rate is Zero-Crossing Rate, abbreviated as ZCR; the English language of the Teager Energy Operator is Teager Energy Operator, abbreviated as TEO; the Mel frequency cepstral coefficients are abbreviated as MFCC, and the speech endpoint detection algorithm, logarithmic energy, zero-crossing rate, Teager energy operator and Mel frequency cepstrum coefficients are all known in the technical field.
In the present embodiment, the calculation operation method is understandable to those skilled in the art.
Claims (1)
1. The video emotion recognition method fusing facial expression recognition and voice emotion recognition is characterized by comprising the following steps of: the decision-level-based two-process progressive audiovisual emotion recognition method comprises the following specific steps:
process a. facial image expression recognition is used as a first classification recognition:
the process A comprises the steps of extracting facial expression characteristics, grouping facial expressions and first-time classification of facial expression recognition, and comprises the following steps:
firstly, video frame extraction and voice signal extraction are carried out on a video signal:
decomposing the video in the database into an image frame sequence, performing video frame extraction by utilizing open-source FormatFactory software, and extracting and storing the voice signal in the video into an MP3 format;
secondly, preprocessing of the image frame sequence and the voice signal:
positioning and cutting the human face of the image frame sequence obtained in the first step by using a disclosed Viola & Jones algorithm, and normalizing the size of the cut human face image into M multiplied by M pixels to obtain an image frame sequence with the normalized size of the human face image;
carrying out voice detection on the voice signals obtained in the first step by utilizing a known voice endpoint detection algorithm VAD and removing noise and silence segments to obtain voice signals with characteristics easier to extract;
thereby completing the pre-processing of the image frame sequence and the voice signal;
thirdly, marking the human face characteristic points in the image frame sequence and screening key frames in the image frame sequence:
carrying out face T feature point marking on the image frame sequence with the normalized face image size in the second step, wherein T is 68, the positions of the 68 feature points are known, the marked feature point outlines are respectively in the eyes, eyebrows, noses and mouth regions of the face image, and calculating the following 6 specific distances for the u frame image in the image frame sequence with the normalized face image size in the second step according to the coordinates of the T feature points:
the distance between the eyes and the eyebrows in the vertical direction is Du,1:Du,1=dvertical||p22,p40||,
The distance in the vertical direction of the opening of the eye is Du,2:Du,2=dvertical||p45,p47||,
The distance between the eyes and the mouth in the vertical direction is Du,3:Du,3=dvertical||p37,p49||,
The distance D in the vertical direction between the nose and the mouthu,4:Du,4=dvertical||p34,p52||,
The distance in the vertical direction of the upper and lower lips is Du,5:Du,5=dvertical||p52,p58||,
The two sides of the mouth have a width distance D in the horizontal directionu,6:Du,6=dhorizontal||p49,p55||,
And is provided with
dvertical||pi,pj||=|pj,y-pi,y|,dhorizontal||pi,pj||=|pj,x-pi,x| (1),
In the formula (1), piIs the coordinate set of the ith feature point, pjIs the coordinate set of the jth feature point, pi,yIs the ordinate, p, of the i-th feature pointj,yIs the ordinate of the jth feature point, pi,xIs the abscissa, p, of the ith feature pointj,xIs the abscissa of the jth feature point, dvertical||pi,pjI is the vertical distance between feature points i and j, dhorizontal||pi,pjI | is the horizontal distance between feature points i and j, i ═ 1,2, …,68, j ═ 1,2, …, 68;
setting the first frame in the image frame sequence with normalized human face image size in the second step as a neutral frame, and the set V of 6 specific distances0As shown in the formula (2),
V0=[D0,1,D0,2,D0,3,D0,4,D0,5,D0,6] (2),
in the formula (2), D0,1,D0,2,D0,3,D0,4,D0,5And D0,66 specific distances corresponding to neutral frames in the image frame sequence with normalized face image size in the second step;
the set V of 6 specific distances of the u frame in the image frame sequence with normalized face image size in the second stepuAs shown in the formula (3),
Vu=[Du,1,Du,2,Du,3,Du,4,Du,5,Du,6] (3),
in the formula (3), u is 1,2, …, K-1, where K is the number of face images in the group of image frame sequences with the face image size normalized in the second step, and Du,1,Du,2,Du,3,Du,4,Du,5,Du,66 specific distances corresponding to the u frame in the image frame sequence with normalized face image size in the second step;
the sum of the 6 corresponding specific distance ratios of the u-th frame and the neutral frame in the image frame sequence with normalized face image size in the second step is shown in formula (4),
in formula (4), DFuRepresenting the sum of the ratios of 6 specific distances corresponding to the neutral frame image and the u frame image in the image frame sequence with the normalized face image size in the second step, n represents the number of 6 specific distances, D0,nRepresenting the n-th specific distance, D, corresponding to the neutral frame in the image frame sequence with normalized face image size in the second stepu,nRepresenting the nth specific distance corresponding to the u frame in the image frame sequence with the normalized face image size in the second step;
in the image frame sequence with the normalized face image size in the second step, the formula (A) is used2) The specific value DF of the specific distance corresponding to each frame of image in the image frame sequence is obtained by the formula (3) and the formula (4)uScreening for maximum DFuThe corresponding u-th frame image is the key frame in the image frame sequence,
thus, marking the facial feature points of the image frame sequence and screening key frames in the image frame sequence;
fourthly, extracting the texture features of the human face:
extracting human face texture features by using an LBP-TOP algorithm, firstly, dividing an image frame sequence with the normalized human face image size in the second step into XY, XT and YT orthogonal planes in space-time, calculating LBP values of central pixel points in 3 x 3 neighborhoods in each orthogonal plane, counting LBP histogram features of the three orthogonal planes, and finally connecting the LBP histograms of the three orthogonal planes to form an integral feature vector, wherein the LBP operator calculation method is shown as a formula (5) and a formula (6),
in the formula (5) and the formula (6), Z is the number of neighborhood points of the central pixel point, R is the distance between the neighborhood points and the central pixel point, and t iscPixel value, t, of the central pixel pointqIs the pixel value of the q-th neighborhood point, Sig (t)q-tc) The LBP coding value for the qth neighborhood point,
the LBP-TOP histogram is defined as shown in equation (7),
in formula (7), b is the number of the plane, b is 0 and XY plane, b is 1 and XT plane, b is 2 and YT plane, n isbIs the number of binary patterns produced by the LBP operator on the b-th plane, I { LBP }Z,R,b(x, y, t) ═ a } employing LBP in the (b) th planeZ,RWhen the operator carries out feature extraction, the number of pixel points with the LBP coding value of a is counted;
fifthly, extracting geometric features of the human face:
calculating coordinates of T feature points marked in the key frame according to the key frame in the screened image frame sequence obtained in the third step to obtain geometric features of the human face expression, wherein in the field of human face expression recognition, the region with the most abundant human face features is a human face T-shaped region and mainly comprises eyebrow, eye, nose, chin and mouth regions, so that the extraction method of the geometric features of the human face mainly extracts distance features between the mark points of the human face T-shaped region;
and 5.1, calculating the Euclidean distance characteristics of the face characteristic point pairs:
selecting 14 pairs of feature points of eyebrows, 12 pairs of feature points of eyes, 12 pairs of feature points of mouths, 6 pairs of feature points of noses and 6 pairs of feature points of chins from T feature points in key frames in the screening image frame sequence obtained in the third step, totaling 50 pairs of feature points, calculating Euclidean distance between feature point pairs A and B, totaling 50 dimension Euclidean distance features, and marking as G50Equation (8) for calculating the euclidean distance between the characteristic point pairs a and B is as follows,
in the formula (8), pAIs a set of coordinates, p, of the feature point ABIs a set of coordinates, p, of the feature point BA,xIs the abscissa, p, of the feature point AA,yIs the ordinate, p, of the characteristic point AB,xIs the abscissa, p, of the feature point BB,yIs the ordinate of the characteristic point B;
and 5.2, calculating the angle characteristics of the face characteristic points:
selecting 10 angles representing face feature change from the T feature points of the key frame obtained by the third step of screening, wherein the eyebrow angle is 2, and the eye angle is 2Calculating 6 angles and 2 angles of the mouth, extracting angle characteristics, and recording the angle characteristics as Q in 10-dimensional angle characteristics10Equation (9) for calculating the angle of the feature point is as follows,
in formula (9), pC、pD、pEThree feature point coordinate sets corresponding to the angles formed by the eyebrow, the eye and the mouth regions marked on the face feature points by the third step, wherein p isDA vertex point coordinate set;
and 5.3, calculating the area characteristics of the face region:
selecting 5 areas of the face image, including left and right eyebrows, two eyes and a mouth, respectively calculating the area characteristics of the 5 areas, and correspondingly subtracting the area of the 5 face areas extracted by the key frame from the area of the 5 face areas extracted by the neutral frame to obtain the change characteristics of the area of the face image area due to the difference of the sizes of the face organs of each person, wherein the total 5 dimensions are recorded as O5Setting the eyebrow area, mouth area and eye area of human face as triangles, calculating the area of each triangle by using Helen formula, and determining the Euclidean distance characteristic G of the feature point pair of human face50Angular characteristic Q of human face characteristic points10And face region area feature O5The geometric features F combined as a face are shown in formula (10),
F=[G50 Q10 O5] (10),
at this point, the facial texture features and the facial geometric features are connected in series to complete the extraction of the facial expression features;
sixthly, grouping the facial expressions:
the six emotions of the human face are: surprise, fear, anger, aversion, happiness and sadness, which are divided into three groups in pairs, wherein the groups are as follows:
a first group: surprise and fear; second group: angry and aversion; third group: happy and sad;
seventhly, classifying the facial expression for the first time:
putting the facial expression features extracted in the fourth step and the fifth step into an ELM classifier for training and testing, thereby finishing the first classification of facial expression recognition and obtaining the recognition result of the first classification of the facial expression recognition, wherein the parameters of the ELM are set as: ELM type: "classification", number of hidden layer neurons: "20", activation function: a "Sigmoid" function;
and B, taking the speech emotion recognition as a second classification recognition:
in the process B, based on the facial expression recognition result in the process a, the speech features are combined, and the speech emotion feature extraction and the second classification of the speech emotion recognition are performed on each of the three groups in the sixth facial expression grouping, specifically, the operations are as follows:
and eighth, extracting the speech emotion characteristics:
and aiming at the classification result of the first classification of the facial expression recognition in the seventh step, according to the grouping in the sixth step, different prosodic features are respectively extracted according to the different sensitivity degrees of the emotions of each group to different audio prosodic features:
a first group: extracting zero crossing rate ZCR and logarithmic energy LogE,
second group: extracting a Teager energy operator TEO, a zero-crossing rate ZCR and a logarithmic energy LogE,
third group: extracting Pitch Pitch, zero-crossing rate ZCR and Teager energy operator TEO,
the Pitch height Pitch of the prosodic feature described above is calculated in the frequency domain,
for the speech signal M preprocessed in the above-described second step, the Pitch Pitch is calculated by the following equation (11),
in equation (11), Pitch is Pitch, DFT is a discrete Fourier transform function, LMWhich represents the length of the speech signal and,representing the speech signal plus a hamming window,is calculated as shown in the following equation (12),
in the formula (12), N is the number of Hamming windows, and m is the mth Hamming window;
the zero crossing rate ZCR in the prosodic feature is calculated as shown in equation (13),
in formula (13), ZCR represents the average zero-crossing rate of N windows, | | is the absolute value sign, x (m) is the speech signal of the mth window after framing and windowing, sgn { x (m) } function judges the sign of the speech amplitude, sgn { x (m) } function is calculated by formula (14),
in formula (14), x (m) is the speech signal of the mth window after the framing windowing;
the above log energy LogE is calculated as in equation (15),
in formula (15), LogE represents the total logarithmic energy of N windows, x (m) is the speech signal of the mth window after framing and windowing, and N is the number of windows;
the Teager energy operator TEO is defined as shown in equation (16),
in formula (16), ψ [ X (m)]The Teager energy operator TEO for the mth window, X' (m) ═ dX (m)/dm, X "(m) ═ dX2(m)/dm2For signals of constant amplitude and frequency: x (m) acos (Φ m + θ), where a is the signal amplitude, Φ is the signal frequency, θ is the signal initial phase angle,
extracting well-known Mel Frequency Cepstrum Coefficients (MFCCs) and first-order difference features and second-order difference features of the well-known Mel Frequency Cepstrum Coefficients (MFCCs) and the first-order difference features of the audio files corresponding to the image frames of each of the three groups in the sixth step of grouping of the facial expressions, and finally connecting the extracted prosody features of each group, the corresponding Mel Frequency Cepstrum Coefficients (MFCCs) and the first-order difference features and the second-order difference features of the prosody features in series to form mixed audio features,
thereby completing the extraction of the speech emotion characteristics;
and ninthly, classifying for the second time of the speech emotion recognition:
putting the speech emotion characteristics extracted in the eighth step into an SVM for training and testing, and finally obtaining the recognition rate of speech emotion recognition, wherein the parameters of the SVM are set as: penalty coefficient: "95", allowing redundant outputs: "0", nuclear parameters: "1", kernel function of support vector machine: the expression "Gaussian nucleus",
thereby completing the second classification of the speech emotion recognition;
and C, fusing facial expression recognition and voice emotion recognition:
and step ten, fusing the facial expression recognition and the voice emotion recognition on a decision level:
since the speech emotion recognition in the process B is a secondary recognition performed on the basis of the face emotion recognition in the process a, the relationship between the two recognition rates belongs to a conditional probability relationship, and the final recognition rate P (Audio _ Visual) calculation method is as shown in formula (17),
P(Audio_Visual)=P(Visual)×P(Audio|Visual) (17),
in formula (17), P (Visual) is the recognition rate of the first face image recognition, and P (Audio/Visual) is the recognition rate of the second speech emotion;
and finishing the progressive video emotion recognition based on two decision-level processes, namely facial expression recognition and voice emotion recognition fusion.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811272233.1A CN109409296B (en) | 2018-10-30 | 2018-10-30 | Video emotion recognition method integrating facial expression recognition and voice emotion recognition |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811272233.1A CN109409296B (en) | 2018-10-30 | 2018-10-30 | Video emotion recognition method integrating facial expression recognition and voice emotion recognition |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109409296A CN109409296A (en) | 2019-03-01 |
CN109409296B true CN109409296B (en) | 2020-12-01 |
Family
ID=65470610
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811272233.1A Expired - Fee Related CN109409296B (en) | 2018-10-30 | 2018-10-30 | Video emotion recognition method integrating facial expression recognition and voice emotion recognition |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109409296B (en) |
Families Citing this family (22)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109961054A (en) * | 2019-03-29 | 2019-07-02 | 山东大学 | It is a kind of based on area-of-interest characteristic point movement anxiety, depression, angry facial expression recognition methods |
CN110363074B (en) * | 2019-06-03 | 2021-03-30 | 华南理工大学 | Humanoid recognition interaction method for complex abstract events |
CN110414335A (en) * | 2019-06-20 | 2019-11-05 | 北京奇艺世纪科技有限公司 | Video frequency identifying method, device and computer readable storage medium |
CN110443143B (en) * | 2019-07-09 | 2020-12-18 | 武汉科技大学 | Multi-branch convolutional neural network fused remote sensing image scene classification method |
CN112308102B (en) * | 2019-08-01 | 2022-05-17 | 北京易真学思教育科技有限公司 | Image similarity calculation method, calculation device, and storage medium |
CN110677598B (en) * | 2019-09-18 | 2022-04-12 | 北京市商汤科技开发有限公司 | Video generation method and device, electronic equipment and computer storage medium |
CN110909613B (en) * | 2019-10-28 | 2024-05-31 | Oppo广东移动通信有限公司 | Video character recognition method and device, storage medium and electronic equipment |
CN111144197A (en) * | 2019-11-08 | 2020-05-12 | 宇龙计算机通信科技(深圳)有限公司 | Human identification method, device, storage medium and electronic equipment |
CN111178389B (en) * | 2019-12-06 | 2022-02-11 | 杭州电子科技大学 | Multi-mode depth layered fusion emotion analysis method based on multi-channel tensor pooling |
CN111553311A (en) * | 2020-05-13 | 2020-08-18 | 吉林工程技术师范学院 | Micro-expression recognition robot and control method thereof |
CN111798874A (en) * | 2020-06-24 | 2020-10-20 | 西北师范大学 | Voice emotion recognition method and system |
CN112101462B (en) * | 2020-09-16 | 2022-04-19 | 北京邮电大学 | Electromechanical device audio-visual information fusion method based on BMFCC-GBFB-DNN |
CN112418095B (en) * | 2020-11-24 | 2023-06-30 | 华中师范大学 | Facial expression recognition method and system combined with attention mechanism |
CN112488219A (en) * | 2020-12-07 | 2021-03-12 | 江苏科技大学 | Mood consolation method and system based on GRU and mobile terminal |
CN112766112B (en) * | 2021-01-08 | 2023-01-17 | 山东大学 | Dynamic expression recognition method and system based on space-time multi-feature fusion |
CN114005153A (en) * | 2021-02-01 | 2022-02-01 | 南京云思创智信息科技有限公司 | Real-time personalized micro-expression recognition method for face diversity |
CN112949560B (en) * | 2021-03-24 | 2022-05-24 | 四川大学华西医院 | Method for identifying continuous expression change of long video expression interval under two-channel feature fusion |
CN113076847B (en) * | 2021-03-29 | 2022-06-17 | 济南大学 | Multi-mode emotion recognition method and system |
CN113065449B (en) * | 2021-03-29 | 2022-08-19 | 济南大学 | Face image acquisition method and device, computer equipment and storage medium |
CN113111789B (en) * | 2021-04-15 | 2022-12-20 | 山东大学 | Facial expression recognition method and system based on video stream |
CN113128399B (en) * | 2021-04-19 | 2022-05-17 | 重庆大学 | Speech image key frame extraction method for emotion recognition |
CN117577140B (en) * | 2024-01-16 | 2024-03-19 | 北京岷德生物科技有限公司 | Speech and facial expression data processing method and system for cerebral palsy children |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1731416A (en) * | 2005-08-04 | 2006-02-08 | 上海交通大学 | Method of quick and accurate human face feature point positioning |
CN105139004A (en) * | 2015-09-23 | 2015-12-09 | 河北工业大学 | Face expression identification method based on video sequences |
CN105976809A (en) * | 2016-05-25 | 2016-09-28 | 中国地质大学(武汉) | Voice-and-facial-expression-based identification method and system for dual-modal emotion fusion |
CN107704810A (en) * | 2017-09-14 | 2018-02-16 | 南京理工大学 | A kind of expression recognition method suitable for medical treatment and nursing |
CN108682431A (en) * | 2018-05-09 | 2018-10-19 | 武汉理工大学 | A kind of speech-emotion recognition method in PAD three-dimensionals emotional space |
-
2018
- 2018-10-30 CN CN201811272233.1A patent/CN109409296B/en not_active Expired - Fee Related
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1731416A (en) * | 2005-08-04 | 2006-02-08 | 上海交通大学 | Method of quick and accurate human face feature point positioning |
CN105139004A (en) * | 2015-09-23 | 2015-12-09 | 河北工业大学 | Face expression identification method based on video sequences |
CN105976809A (en) * | 2016-05-25 | 2016-09-28 | 中国地质大学(武汉) | Voice-and-facial-expression-based identification method and system for dual-modal emotion fusion |
CN107704810A (en) * | 2017-09-14 | 2018-02-16 | 南京理工大学 | A kind of expression recognition method suitable for medical treatment and nursing |
CN108682431A (en) * | 2018-05-09 | 2018-10-19 | 武汉理工大学 | A kind of speech-emotion recognition method in PAD three-dimensionals emotional space |
Non-Patent Citations (3)
Title |
---|
"A new approach of audio emotion recognition";Chien Shing Ooi.et al;《ELSEVIER》;20140324;期刊第5858-5869页 * |
"基于LBP-TOP特征的微表情识别";卢官明等;《南京邮电大学学报(自然科学版)》;20171231;第37卷(第6期);第2章2.7-2.8节 * |
Emotion recognition using facial and audio features;Tarun Krishna.et al;《ICMI "13: Proceedings of the 15th ACM on International conference on multimodal interaction》;20131231;全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN109409296A (en) | 2019-03-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109409296B (en) | Video emotion recognition method integrating facial expression recognition and voice emotion recognition | |
CN112784798B (en) | Multi-modal emotion recognition method based on feature-time attention mechanism | |
CN108805087B (en) | Time sequence semantic fusion association judgment subsystem based on multi-modal emotion recognition system | |
CN108805089B (en) | Multi-modal-based emotion recognition method | |
CN108597541B (en) | Speech emotion recognition method and system for enhancing anger and happiness recognition | |
CN108877801B (en) | Multi-turn dialogue semantic understanding subsystem based on multi-modal emotion recognition system | |
CN108805088B (en) | Physiological signal analysis subsystem based on multi-modal emotion recognition system | |
Zhang et al. | Multimodal deep convolutional neural network for audio-visual emotion recognition | |
Chen et al. | Extracting speaker-specific information with a regularized siamese deep network | |
CN108269133A (en) | A kind of combination human bioequivalence and the intelligent advertisement push method and terminal of speech recognition | |
Liu et al. | Re-synchronization using the hand preceding model for multi-modal fusion in automatic continuous cued speech recognition | |
CN101187990A (en) | A session robotic system | |
Datcu et al. | Emotion recognition using bimodal data fusion | |
CN102930298A (en) | Audio visual emotion recognition method based on multi-layer boosted HMM | |
CN111583964A (en) | Natural speech emotion recognition method based on multi-mode deep feature learning | |
Ocquaye et al. | Dual exclusive attentive transfer for unsupervised deep convolutional domain adaptation in speech emotion recognition | |
CN113643723A (en) | Voice emotion recognition method based on attention CNN Bi-GRU fusion visual information | |
CN108256307A (en) | A kind of mixing enhancing intelligent cognition method of intelligent business Sojourn house car | |
Lim et al. | Emotion recognition by facial expression and voice: review and analysis | |
Shinde et al. | Real time two way communication approach for hearing impaired and dumb person based on image processing | |
Jaratrotkamjorn et al. | Bimodal emotion recognition using deep belief network | |
CN107437090A (en) | The continuous emotion Forecasting Methodology of three mode based on voice, expression and electrocardiosignal | |
Veni et al. | Feature fusion in multimodal emotion recognition system for enhancement of human-machine interaction | |
CN116226372A (en) | Bi-LSTM-CNN-based multi-modal voice emotion recognition method | |
Atkar et al. | Speech emotion recognition using dialogue emotion decoder and CNN Classifier |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20201201 |