CN109409296B

CN109409296B - Video emotion recognition method integrating facial expression recognition and voice emotion recognition

Info

Publication number: CN109409296B
Application number: CN201811272233.1A
Authority: CN
Inventors: 于明; 张冰; 郭迎春; 于洋; 师硕; 郝小可; 朱叶; 阎刚
Original assignee: Hebei University of Technology
Current assignee: Hebei University of Technology
Priority date: 2018-10-30
Filing date: 2018-10-30
Publication date: 2020-12-01
Anticipated expiration: 2038-10-30
Also published as: CN109409296A

Abstract

The invention relates to a video emotion recognition method integrating facial expression recognition and voice emotion recognition, which relates to the processing of a recording carrier for recognizing graphs and is a decision-level-based audio-visual emotion recognition method with two progressive processes, wherein the method separates the facial expression recognition and the voice emotion recognition in a video, and adopts the two progressive emotion recognition methods to perform the voice emotion recognition on the basis of the facial expression recognition by calculating the conditional probability; the method comprises the following steps: the facial image expression recognition is used as first classification recognition; step B, taking the speech emotion recognition as second classification recognition; and C, fusing the facial expression recognition and the voice emotion recognition. The invention overcomes the defects that the prior art ignores the internal relation between the human face characteristic and the voice characteristic in the human emotion recognition, and the video emotion recognition has low recognition speed and low recognition rate.

Description

Video emotion recognition method integrating facial expression recognition and voice emotion recognition

Technical Field

The technical scheme of the invention relates to processing of a record carrier for recognizing graphs, in particular to a video emotion recognition method fusing facial expression recognition and voice emotion recognition.

Background

With the rapid development of artificial intelligence and computer vision technology, man-machine interaction technology is changing day by day, and human emotion recognition technology by using a computer has received extensive attention, so that how to enable the computer to more quickly and accurately recognize human emotion becomes a research hotspot in the field of machine vision at present.

Human emotion expression modes are various and mainly include facial expressions, speech emotions, upper body postures, language texts and the like. Among them, facial expression and speech emotion are two most typical emotion expression modes. Because the texture and the geometric features of the human face are easy to extract, the emotion recognition method based on the human face expression can achieve higher recognition rate in the current emotion recognition field. However, in some cases where the expressions are similar, such as anger and dislike, fear and surprise, their texture features and geometric features are similar, and the recognition rate is not high only by extracting the features of the facial expressions.

The single-modal emotion recognition method is often limited, and the bi-modal or multi-modal emotion recognition method is more and more a hotspot for research and attention in the emotion recognition research field. The key of the multi-modal emotion recognition method lies in a multi-modal fusion mode, wherein the mainstream fusion mode comprises a feature level fusion mode and a decision level fusion mode.

In 2012, Schuller et al cascaded audio and video features into a single feature vector in the "AVEC: the connected audio/visual annotation challenge", and used support vector regression SVR as a baseline in the AVEC 2012 challenge, and this feature-level fusion method directly cascaded multi-modal features to construct a combined feature vector. Because the large number of multi-modal features is likely to cause dimension disasters, high-dimensional features are easily plagued by the problem of data sparsity, and the advantages of combining audio features and video features in a feature level fusion mode are limited in consideration of the interaction between the features.

The decision-level fusion mode means that a plurality of emotion expression modes can be modeled by corresponding classifiers firstly, then the recognition results of each classifier are fused together to form a decision-level fusion mode, and different modes are combined through the contribution degrees of different emotion expressions under the condition of not increasing the dimension. Seng et al in the paper "A combined rule-based and machine learning audio-visual learning approach" divides the audiovisual emotion recognition into two mutually independent paths to extract features respectively, then models on respective classifiers to obtain respective corresponding recognition rates, and finally obtains the final recognition rate according to a proportional scoring mechanism and corresponding weight distribution. The existing decision-level fusion method has two main disadvantages, namely: the proportion scoring mechanism and the weight distribution strategy lack a uniform authoritative standard, and different researchers often obtain different identification results on the same research project according to various proportion scoring mechanisms and different weight distribution strategies; secondly, the method comprises the following steps: the decision-level fusion mode focuses on the fusion of the face recognition result and the voice recognition result, and ignores the internal relation between the face feature and the voice feature.

CN106529504A discloses a bimodal video emotion recognition method of composite space-time characteristics, which expands the existing volume local binary pattern algorithm into a space-time ternary pattern, acquires space-time local ternary pattern moment texture characteristics of human face expression and upper body posture, further fuses three-dimensional gradient direction histogram characteristics to enhance description of emotion video, and combines the two characteristics into composite space-time characteristics.

CN105512609A discloses a multi-mode fusion video emotion recognition method based on a kernel-based ultralimit learning machine, which is used for carrying out feature extraction and feature selection on image information and audio information of a video so as to obtain video features; preprocessing, feature extraction and feature selection are carried out on the acquired multi-channel electroencephalogram signals, so that electroencephalogram features are obtained; establishing a multi-mode fusion video emotion recognition model based on a kernel overrun learning machine; and inputting the video characteristics and the electroencephalogram characteristics into a multi-mode fusion video emotion recognition model based on a kernel-based over-limit learning machine to perform video emotion recognition, and obtaining the final classification accuracy. However, the algorithm has the defects of high classification recognition rate and low usability only for three types of video emotion data.

CN103400145B discloses a speech-vision fusion emotion recognition method based on clue neural network, which firstly uses the feature data of three channels of human face expression, side face expression and speech to train a neural network independently to execute the recognition of discrete emotion categories, 4 clue nodes are added in the output layer of the neural network model in the training process to respectively bear clue information of 4 coarse-grained categories in activity-evaluation degree space, then a multi-mode fusion model is used to fuse the output results of the three neural networks, the multi-mode fusion model also adopts the neural network trained based on the clue information, however, in most videos, the number of frames of facial side face expression frames is small, and effective collection is difficult to carry out, so that the method has great limitation on the practical operation. The method also relates to training and fusion of the neural network, and as the data volume increases and the data dimension increases, the consumption of training time and resources gradually increases, and the error rate also gradually increases.

CN105138991B discloses a video emotion recognition method based on emotion significance feature fusion, which extracts audio features and visual emotion features from each video shot in a training video set, wherein the audio features form emotion distribution histogram features based on a word bag model; the visual emotion characteristics form emotion attention characteristics based on a visual dictionary, and the emotion attention characteristics and the emotion distribution histogram characteristics are fused from top to bottom to form video characteristics with emotion significance. The method only extracts the characteristics of the video key frames when extracting the visual emotion characteristics, and ignores the association relationship of the characteristics between the video frames to a certain extent.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: the invention provides a video emotion recognition method fusing facial expression recognition and voice emotion recognition, which is a decision-level-based two-process progressive audiovisual emotion recognition method.

The technical scheme adopted by the invention for solving the technical problem is as follows: a video emotion recognition method fusing facial expression recognition and voice emotion recognition is a decision-level-based two-process progressive audiovisual emotion recognition method, and specifically comprises the following steps:

process a. facial image expression recognition is used as a first classification recognition:

the process A comprises the steps of extracting facial expression characteristics, grouping facial expressions and first-time classification of facial expression recognition, and comprises the following steps:

firstly, video frame extraction and voice signal extraction are carried out on a video signal:

decomposing the video in the database into an image frame sequence, performing video frame extraction by utilizing open-source FormatFactory software, and extracting and storing the voice signal in the video into an MP3 format;

secondly, preprocessing of the image frame sequence and the voice signal:

positioning and cutting the human face of the image frame sequence obtained in the first step by using a disclosed Viola & Jones algorithm, and normalizing the size of the cut human face image into M multiplied by M pixels to obtain an image frame sequence with the normalized size of the human face image;

carrying out voice detection on the voice signals obtained in the first step by utilizing a known voice endpoint detection algorithm VAD and removing noise and silence segments to obtain voice signals with characteristics easier to extract;

thereby completing the pre-processing of the image frame sequence and the voice signal;

thirdly, marking the human face characteristic points in the image frame sequence and screening key frames in the image frame sequence:

carrying out face T feature point marking on the image frame sequence with the second step of face image size normalization, wherein T is 68, the positions of the 68 feature points are known, the marked feature point outlines are respectively in the eyes, eyebrows, noses and mouth regions of the face image, and the following 6 specific distances are calculated for the u frame image in the image frame sequence with the second step of face image size normalization according to the coordinates of the T feature points:

the distance between the eyes and the eyebrows in the vertical direction is D_u,1：D_u,1＝d_vertical||p₂₂,p₄₀||，

The distance in the vertical direction of the opening of the eye is D_u,2：D_u,2＝d_vertical||p₄₅,p₄₇||，

The distance between the eyes and the mouth in the vertical direction is D_u,3：D_u,3＝d_vertical||p₃₇,p₄₉||，

The distance D in the vertical direction between the nose and the mouth_u,4：D_u,4＝d_vertical||p₃₄,p₅₂||，

The distance in the vertical direction of the upper and lower lips is D_u,5：D_u,5＝d_vertical||p₅₂,p₅₈||，

The two sides of the mouth have a width distance D in the horizontal direction_u,6：D_u,6＝d_horizontal||p₄₉,p₅₅||，

And is provided with

d_vertical||p_i,p_j||＝|p_j,y-p_i,y|,d_horizontal||p_i,p_j||＝|p_j,x-p_i,x| (1)，

In the formula (1), p_iIs the coordinate set of the ith feature point, p_jIs the coordinate set of the jth feature point, p_i,yIs the ordinate, p, of the i-th feature point_j,yIs the ordinate of the jth feature point, p_i,xIs the abscissa, p, of the ith feature point_j,xIs the abscissa of the jth feature point, d_vertical||p_i,p_jI is the vertical distance between feature points i and j, d_horizontal||p_i,p_jI | is the horizontal distance between feature points i and j, i ═ 1,2, …,68, j ═ 1,2, …, 68;

setting the first frame in the image frame sequence with normalized human face image size in the second step as a neutral frame, and the set V of 6 specific distances₀As shown in the formula (2),

V₀＝[D_0,1,D_0,2,D_0,3,D_0,4,D_0,5,D_0,6] (2)，

in the formula (2), D_0,1,D_0,2,D_0,3,D_0,4,D_0,5And D _0,66 specific distances corresponding to neutral frames in the image frame sequence with normalized face image size in the second step;

the set V of 6 specific distances of the u frame in the image frame sequence with normalized face image size in the second step_uAs shown in the formula (3),

V_u＝[D_u,1,D_u,2,D_u,3,D_u,4,D_u,5,D_u,6] (3)，

in the formula (3), u is 1,2, …, K-1, where K is the number of face images in the group of image frame sequences with the face image size normalized in the second step, and D_u,1,D_u,2,D_u,3,D_u,4,D_u,5,D _u,66 specific distances corresponding to the u frame in the image frame sequence with normalized face image size in the second step;

the sum of the 6 corresponding specific distance ratios of the u-th frame and the neutral frame in the image frame sequence with normalized face image size in the second step is shown in formula (4),

in formula (4), DF_uRepresenting the sum of the ratios of 6 specific distances corresponding to the neutral frame image and the u frame image in the image frame sequence with the normalized face image size in the second step, n represents the number of 6 specific distances, D_0,nRepresenting the n-th specific distance, D, corresponding to the neutral frame in the image frame sequence with normalized face image size in the second step_u,nRepresenting the nth specific distance corresponding to the u frame in the image frame sequence with the normalized face image size in the second step;

in the image frame sequence with normalized face image size in the second step, the ratio DF of the specific distance corresponding to each frame image in the image frame sequence is obtained according to the formula (2), the formula (3) and the formula (4)_uScreening for maximum DF_uThe corresponding u-th frame image is the key frame in the image frame sequence,

thus, marking the facial feature points of the image frame sequence and screening key frames in the image frame sequence;

fourthly, extracting the texture features of the human face:

extracting human face texture features by using an LBP-TOP algorithm, firstly, dividing an image frame sequence with the normalized human face image size in the second step into XY, XT and YT orthogonal planes in space-time, calculating LBP values of central pixel points in 3 x 3 neighborhoods in each orthogonal plane, counting LBP histogram features of the three orthogonal planes, and finally connecting the LBP histograms of the three orthogonal planes to form an integral feature vector, wherein the LBP operator calculation method is shown as a formula (5) and a formula (6),

in the formula (5) and the formula (6), Z is the number of neighborhood points of the central pixel point, R is the distance between the neighborhood points and the central pixel point, and t is_cPixel value, t, of the central pixel point_qIs the pixel value of the q-th neighborhood point, Sig (t)_q-t_c) The LBP coding value for the qth neighborhood point,

the LBP-TOP histogram is defined as shown in equation (7),

in formula (7), b is the number of the plane, b is 0 and XY plane, b is 1 and XT plane, b is 2 and YT plane, n is_bIs the number of binary patterns produced by the LBP operator on the b-th plane, I { LBP }_Z,R,b(x, y, t) ═ a } employing LBP in the (b) th plane_Z,RWhen the operator carries out feature extraction, the number of pixel points with the LBP coding value of a is counted;

fifthly, extracting geometric features of the human face:

calculating coordinates of T feature points marked in the key frame according to the key frame in the screened image frame sequence obtained in the third step to obtain geometric features of the human face expression, wherein in the field of human face expression recognition, the region with the most abundant human face features is a human face T-shaped region and mainly comprises eyebrow, eye, nose, chin and mouth regions, so that the extraction method of the geometric features of the human face mainly extracts distance features between the mark points of the human face T-shaped region;

and 5.1, calculating the Euclidean distance characteristics of the face characteristic point pairs:

selecting 14 pairs of feature points of eyebrows, 12 pairs of feature points of eyes, 12 pairs of feature points of mouths, 6 pairs of feature points of noses and 6 pairs of feature points of chins from T feature points in key frames in the screening image frame sequence obtained in the third step, totaling 50 pairs of feature points, calculating Euclidean distance between feature point pairs A and B, totaling 50 dimension Euclidean distance features, and marking as G₅₀Calculating characteristicsEquation (8) characterizing the euclidean distance between points a and B is shown below,

in the formula (8), p_AIs a set of coordinates, p, of the feature point A_BIs a set of coordinates, p, of the feature point B_A,xIs the abscissa, p, of the feature point A_A,yIs the ordinate, p, of the characteristic point A_B,xIs the abscissa, p, of the feature point B_B,yIs the ordinate of the characteristic point B;

and 5.2, calculating the angle characteristics of the face characteristic points:

selecting 10 angles representing face feature changes from the T feature points of the key frame obtained by screening in the third step, wherein 2 angles of eyebrows, 6 angles of eyes and 2 angles of mouths are calculated, extracting angle features, and 10-dimensional angle features in total are recorded as Q₁₀Equation (9) for calculating the angle of the feature point is as follows,

in formula (9), p_C、p_D、p_EThree feature point coordinate sets corresponding to the angles formed by the eyebrow, the eye and the mouth regions marked on the face feature points by the third step, wherein p is_DA vertex point coordinate set;

and 5.3, calculating the area characteristics of the face region:

selecting 5 areas of the face image, including left and right eyebrows, two eyes and a mouth, respectively calculating the area characteristics of the 5 areas, and correspondingly subtracting the area of the 5 face areas extracted by the key frame from the area of the 5 face areas extracted by the neutral frame to obtain the change characteristics of the area of the face image area due to the difference of the sizes of the face organs of each person, wherein the total 5 dimensions are recorded as O₅Setting the eyebrow area, mouth area and eye area of human face as triangles, and calculating each triangle surface by using Helen formulaProduct, the Euclidean distance characteristic G of the face characteristic point pair₅₀Angular characteristic Q of human face characteristic points₁₀And face region area feature O₅The geometric features F combined as a face are shown in formula (10),

F＝[G₅₀ Q₁₀ O₅] (10)，

at this point, the facial texture features and the facial geometric features are connected in series to complete the extraction of the facial expression features;

sixthly, grouping the facial expressions:

the six emotions of the human face are: surprise, fear, anger, aversion, happiness and sadness, which are divided into three groups in pairs, wherein the groups are as follows:

a first group: surprise and fear; second group: angry and aversion; third group: happy and sad;

seventhly, classifying the facial expression for the first time:

putting the facial expression features extracted in the fourth step and the fifth step into an ELM classifier for training and testing, thereby finishing the first classification of facial expression recognition and obtaining the recognition result of the first classification of the facial expression recognition, wherein the parameters of the ELM are set as: ELM type: "classification", number of hidden layer neurons: "20", activation function: a "Sigmoid" function;

and B, taking the speech emotion recognition as a second classification recognition:

in the process B, based on the facial expression recognition result in the process a, the speech features are combined, and the speech emotion feature extraction and the second classification of the speech emotion recognition are performed on each of the three groups in the sixth facial expression grouping, specifically, the operations are as follows:

and eighth, extracting the speech emotion characteristics:

and aiming at the classification result of the first classification of the facial expression recognition in the seventh step, according to the grouping in the sixth step, different prosodic features are respectively extracted according to the different sensitivity degrees of the emotions of each group to different audio prosodic features:

a first group: extracting zero crossing rate ZCR and logarithmic energy LogE,

second group: extracting a Teager energy operator TEO, a zero-crossing rate ZCR and a logarithmic energy LogE,

third group: extracting Pitch Pitch, zero-crossing rate ZCR and Teager energy operator TEO,

the Pitch height Pitch of the prosodic feature described above is calculated in the frequency domain,

for the speech signal M preprocessed in the above-described second step, the Pitch Pitch is calculated by the following equation (11),

in equation (11), Pitch is Pitch, DFT is a discrete Fourier transform function, L_MWhich represents the length of the speech signal and,

representing the speech signal plus a hamming window,

is calculated as shown in the following equation (12),

in the formula (12), N is the number of Hamming windows, and m is the mth Hamming window;

the zero crossing rate ZCR in the prosodic feature is calculated as shown in equation (13),

in formula (13), ZCR represents the average zero-crossing rate of N windows, | | is the absolute value sign, x (m) is the speech signal of the mth window after framing and windowing, sgn { x (m) } function judges the sign of the speech amplitude, sgn { x (m) } function is calculated by formula (14),

in formula (14), x (m) is the speech signal of the mth window after the framing windowing;

the above log energy LogE is calculated as in equation (15),

in formula (15), LogE represents the total logarithmic energy of N windows, x (m) is the speech signal of the mth window after framing and windowing, and N is the number of windows;

the Teager energy operator TEO is defined as shown in equation (16),

in formula (16), ψ [ X (m)]The Teager energy operator TEO for the mth window, X' (m) ═ dX (m)/dm, X "(m) ═ dX²(m)/dm²For signals of constant amplitude and frequency: x (m) is a cos (Φ m + θ), where a is the signal amplitude, Φ is the signal frequency, θ is the signal initial phase angle,

extracting well-known Mel Frequency Cepstrum Coefficients (MFCCs) and first-order difference features and second-order difference features of the well-known Mel Frequency Cepstrum Coefficients (MFCCs) and the first-order difference features of the audio files corresponding to the image frames of each of the three groups in the sixth step of grouping of the facial expressions, and finally connecting the extracted prosody features of each group, the corresponding Mel Frequency Cepstrum Coefficients (MFCCs) and the first-order difference features and the second-order difference features of the prosody features in series to form mixed audio features,

thereby completing the extraction of the speech emotion characteristics;

and ninthly, classifying for the second time of the speech emotion recognition:

putting the speech emotion characteristics extracted in the eighth step into an SVM for training and testing, and finally obtaining the recognition rate of speech emotion recognition, wherein the parameters of the SVM are set as: penalty coefficient: "95", allowing redundant outputs: "0", nuclear parameters: "1", kernel function of support vector machine: the expression "Gaussian nucleus",

thereby completing the second classification of the speech emotion recognition;

and C, fusing facial expression recognition and voice emotion recognition:

and step ten, fusing the facial expression recognition and the voice emotion recognition on a decision level:

since the speech emotion recognition in the process B is a secondary recognition performed on the basis of the face emotion recognition in the process a, the relationship between the two recognition rates belongs to a conditional probability relationship, and the final recognition rate P (Audio _ Visual) calculation method is as shown in formula (17),

P(Audio_Visual)＝P(Visual)×P(Audio|Visual) (17)，

in formula (17), P (Visual) is the recognition rate of the first face image recognition, and P (Audio/Visual) is the recognition rate of the second speech emotion;

and finishing the progressive video emotion recognition based on two decision-level processes, namely facial expression recognition and voice emotion recognition fusion.

In the video emotion recognition method combining facial expression recognition and speech emotion recognition, in the third step, the coordinates of the T feature points are determined, where T is 68.

In the video emotion recognition method combining facial expression recognition and Voice emotion recognition, the Voice endpoint Detection algorithm is called Voice Activity Detection (Voice Activity Detection) in english, and is abbreviated as VAD; the English of Zero-Crossing Rate is Zero-Crossing Rate, abbreviated as ZCR; english of logarithmic energy is LogEnergy, abbreviated LogE; the Mel frequency cepstral coefficients are Mel-frequency cepstral coefficients in English, and are abbreviated as MFCC; the Teager Energy Operator is called Tea in English, and the speech endpoint detection algorithm, zero-crossing rate, logarithmic Energy, mel-frequency cepstrum coefficient and the Teager Energy Operator are well known in the technical field.

The above video emotion recognition method combining facial expression recognition and speech emotion recognition is a calculation operation method that can be grasped by those skilled in the art.

The invention has the beneficial effects that: compared with the prior art, the invention has the prominent substantive characteristics and remarkable progress as follows:

(1) the invention provides a video emotion recognition method fusing facial expression recognition and voice emotion recognition, which is an audio-visual emotion recognition method based on decision level, the method separates the facial expression recognition and the voice emotion recognition in the video, adopts a progressive emotion recognition method of two processes, and adopts a conditional probability calculation mode, the speech emotion recognition technology which is carried out on the basis of the facial expression recognition fully considers the influence of the result of the facial expression recognition on the speech emotion recognition, the human face expression recognition and the voice emotion recognition are fused more closely and mutually assisted, so that a more ideal human emotion recognition effect is obtained, and the defects that in the prior art, the internal relation between the human face features and the voice features is ignored in the human emotion recognition, the recognition speed of the video emotion recognition is low and the recognition rate is low are overcome.

(2) Chien et al have an experiment of 'audio characteristic analysis' in a 'new approach of audio observation recognition' in 2014, and the experiment demonstrates that 6 emotions have different sensitivity degrees on rhythm characteristics, namely Pitch, Zero-Crossing Rate, Loginergy and Teager Energy Operator. The paper classifies Mel-Frequency Cepstral Coefficients, MFCC extracted from audio by using an SVM classifier, the recognition rate is gradually reduced when two-classification, four-classification and six-classification are carried out, the classification effect of the classifier is better when the classified classes are less, therefore, in the invention, three classifications are selected during the first face expression classification, and the second classification is selected during the second audio classification. The method simplifies the multi-classification problem into a three-classification problem and a two-classification problem, thereby reducing the feature dimension, shortening the training time and greatly improving the efficiency of the algorithm.

(3) Compared with CN106529504A, the method of the invention has the advantages that the method not only extracts the face characteristics in the video, but also extracts the audio characteristics in the video, and the bimodal combination of the face characteristics and the audio characteristics is beneficial to more accurately identifying the emotion of the person in the video.

(4) Compared with CN105512609A, the method provided in CN105512609A can only identify three emotions in a video, but the method can identify six emotions in the video, and the average identification rate of the method is 9.92% higher than that of the video in CN 105512609A.

(5) Compared with CN105138991A, the method of the invention has the advantages of classifying the human face features and the voice features respectively, avoiding the 'dimension disaster' easily caused by feature level fusion, along with simple operation and faster training and recognition speed in the fusion method at decision level.

(6) When the audio features are extracted, the different sensibilities of different emotions to different audio features are considered, so that different audio features are extracted from each group, and the second classification based on the voice features is facilitated.

(7) The method extracts texture, geometry, time and rhythm characteristics, different characteristics reflect different characteristics of expressions, and a classifier can be well trained to perform video emotion recognition from multiple modes.

(8) The invention applies a secondary progressive emotion classification method, mainly uses face recognition and auxiliary voice recognition, and the face recognition and the auxiliary voice recognition are complementary and mutually assisted, thereby realizing more accurate video emotion recognition.

Drawings

The invention is further illustrated with reference to the following figures and examples.

FIG. 1 is a block diagram showing the flow of the method of the present invention.

Fig. 2 is a schematic diagram of the labeling of 6 specific distances and 68 feature points of a human face.

Fig. 3 is a diagram of an example of 68 feature point labels for a face in the eNTERFACE' 05 database.

Detailed Description

The embodiment shown in fig. 1 shows that the process of the method of the present invention comprises: process A.

And (B) carrying out the process.

And (C) carrying out the process.

The embodiment shown in fig. 2 shows the labels of 68 feature points of the human face and 6 specific distances, which are sequentially the vertical distance between the feature points 22 and 40 and the distance D, and is an exemplary image labeled by the feature points_u,1The vertical distance between feature points 45 and 47 is denoted by D_u,2Indicating that the vertical distance between feature points 37 and 49 is in D_u,3The vertical distance between feature points 34 and 52 is denoted by D_u,4The vertical distance between feature points 52 and 58 is denoted by D_u,5The horizontal distance between feature points 49 and 55 is denoted by D_u,6And (4) showing. The connecting lines between the feature points in the figure outline the eyebrow, eye and mouth regions of the human face.

The embodiment shown in fig. 3 shows an example of labeling a face in the eNTERFACE' 05 database with Dlib feature points, where 68 feature points labeled in the figure correspond to the labels of the 68 feature points shown in the schematic diagram of labeling the face feature points in fig. 2.

Example 1

The video emotion recognition method fusing facial expression recognition and voice emotion recognition in the embodiment is a decision-level-based two-process progressive audiovisual emotion recognition method, and specifically comprises the following steps:

secondly, preprocessing of the image frame sequence and the voice signal:

marking the image frame sequence with the normalized size of the face image in the second step with T feature points, wherein the value range of T is 1,2, …,68, the positions of the 68 feature points are known, the outlines of the marked feature points are in the eye, eyebrow, nose and mouth regions of the face image respectively, and in this embodiment, the following 6 specific distances are calculated for the u frame image in the image frame sequence with the normalized size of the face image in the second step according to the coordinates of T ═ 68 feature points:

And is provided with

V₀＝[D_0,1,D_0,2,D_0,3,D_0,4,D_0,5,D_0,6] (2)，

V_u＝[D_u,1,D_u,2,D_u,3,D_u,4,D_u,5,D_u,6] (3)，

in the image frame sequence with normalized face image size in the second step, each image frame in the image frame sequence is obtained according to the formula (2), the formula (3) and the formula (4)Ratio DF of specific distances corresponding to the images_uScreening for maximum DF_uThe corresponding u-th frame image is the key frame in the image frame sequence,

fourthly, extracting the texture features of the human face:

the LBP-TOP histogram is defined as shown in equation (7),

in formula (7), b is the number of the plane, b is 0 and XY plane, b is 1 and XT plane, b is 2 and YT plane, n is_bIs the number of binary patterns produced by the LBP operator on the b-th plane, I { LBP }_Z,R,b(x, y, t) ═ a } isUsing LBP in the b-th plane_Z,RWhen the operator carries out feature extraction, the number of pixel points with the LBP coding value of a is counted;

fifthly, extracting geometric features of the human face:

calculating coordinates of T feature points marked in the key frame according to the key frame in the screened image frame sequence obtained in the third step to obtain geometric features of the human face expression, wherein in the field of human face expression recognition, the region with the most abundant human face features is a human face T-shaped region and mainly comprises eyebrow, eye, nose, chin and mouth regions, and specific feature points are shown in a table 2, so that the extraction method of the geometric features of the human face mainly extracts distance features between the mark points of the human face T-shaped region;

selecting 14 pairs of feature points of eyebrows, 12 pairs of feature points of eyes, 12 pairs of feature points of mouths, 6 pairs of feature points of noses and 6 pairs of feature points of chins from T feature points in key frames in the screening image frame sequence obtained in the third step, totaling 50 pairs of feature points, calculating Euclidean distance between feature point pairs A and B, totaling 50 dimension Euclidean distance features, and marking as G₅₀Equation (8) for calculating the euclidean distance between the characteristic point pairs a and B is as follows,

table 1 shows the pairs of face feature points to be calculated in the face T-shaped region, where d | | p_A,p_B| | represents the euclidean distance between pairs of feature points A, B;

TABLE 1

selecting 10 angles representing face feature changes from the 68 feature points of the key frame obtained by screening in the third step, wherein 2 angles of eyebrows, 6 angles of eyes and 2 angles of mouths are calculated, extracting angle features, and recording the angle features as Q in 10 dimensions₁₀The specific angle is shown in Table 3, and the formula (9) for calculating the angle of the feature point is as follows,

table 2 shows the angles of the face feature points to be calculated in the face T-shaped region. Wherein Q (p)_C,p_D,p_E) An angular characteristic representing an angle D;

TABLE 2

And 5.3, calculating the area characteristics of the face region:

selecting 5 areas of the face image, including left and right eyebrows, two eyes and a mouth, and respectively calculating the area characteristics of the 5 areas, wherein the specific area areas are shown in a table 3;

TABLE 3

Table 3 shows the area of the region surrounded by the face feature points to be calculated in the face T-shaped region,wherein O (p)_A,p_B,p_C,p_D) Represents the area of the region surrounded by the line connecting the characteristic points A, B, C, D;

because of the difference of the sizes of the human face organs of each person, the area of 5 human face regions extracted from the key frame in the table 4 is correspondingly subtracted from the area of 5 human face regions extracted from the neutral frame to obtain the change characteristic of the area of the human face image region, and the total 5 dimensions are marked as O₅Setting the eyebrow area, mouth area and eye area of human face as triangles, calculating the area of each triangle by using Helen formula, and determining the Euclidean distance characteristic G of the feature point pair of human face₅₀Angular characteristic Q of human face characteristic points₁₀And face region area feature O₅The geometric features F combined as a face are shown in formula (10),

F＝[G₅₀ Q₁₀ O₅] (10)，

sixthly, grouping the facial expressions:

seventhly, classifying the facial expression for the first time:

and eighth, extracting the speech emotion characteristics:

a first group: extracting zero crossing rate ZCR and logarithmic energy LogE,

representing the speech signal plus a hamming window,

is calculated as shown in the following equation (12),

the above log energy LogE is calculated as in equation (15),

the Teager energy operator TEO is defined as shown in equation (16),

thereby completing the extraction of the speech emotion characteristics;

and ninthly, classifying for the second time of the speech emotion recognition:

thereby completing the second classification of the speech emotion recognition;

and C, fusing facial expression recognition and voice emotion recognition:

because the speech emotion recognition is the secondary recognition based on the face emotion recognition, the relationship of the two recognition rates belongs to the relationship of the conditional probability, the final recognition rate P (Audio _ Visual) calculation method is shown as the formula (17),

P(Audio_Visual)＝P(Visual)×P(Audio|Visual) (17)，

in formula (17), P (Visual) is the recognition rate of the first face image recognition, and P (Audio | Visual) is the recognition rate of the second speech emotion;

In this example, the results of comparison experiments with the prior art on the database of eTERFACE' 05 and RML show that the specific recognition rates are shown in Table 4 below:

TABLE 4

The experimental results in Table 4 show the comparison of the recognition rates of the audio visual emotion recognition system in the last few years on the eNewFACE' 05 and RML databases: the average recognition rate of Audiovisual emotion recognition on the eNFERFACE' 05 database in the "Audio recording recognition method and multi-classifier neural networks" document by Mahdi Bejani et al, 2014 was 77.78%;

the average Recognition rate of audiovisual Emotion Recognition performed on the RML database in "Multimodal Deep social Network for Audio-Visual Emotion Recognition" document by Shiqing Zhang et al 2016 is 74.32%;

the average Recognition rates of audiovisual Emotion Recognition by Shiqing Zhang et al 2017 on the eNTERFACE' 05 and RML databases in the "Learning affinity Features with a Hybrid Deep Model for Audio-Visual Emotion Recognition" document are 85.97% and 80.36%, respectively;

the average recognition rates of audiovisual emotion recognition on the eNBACE' 05 and RML databases in "Audio-visual experience fusion (AVEF): A deep visual weighted approach" literature by Yaxiong Ma et al in 2018 were 84.56% and 81.98%, respectively; the decision-level-based two-process progressive audiovisual emotion recognition method adopted by the embodiment has a relatively large improvement in recognition rate compared with a paper in recent years.

In this embodiment, the english of the Voice endpoint Detection algorithm is Voice Activity Detection, abbreviated as VAD, and the english of logarithmic energy is LogEnergy, abbreviated as LogE; the English of Zero-Crossing Rate is Zero-Crossing Rate, abbreviated as ZCR; the English language of the Teager Energy Operator is Teager Energy Operator, abbreviated as TEO; the Mel frequency cepstral coefficients are abbreviated as MFCC, and the speech endpoint detection algorithm, logarithmic energy, zero-crossing rate, Teager energy operator and Mel frequency cepstrum coefficients are all known in the technical field.

In the present embodiment, the calculation operation method is understandable to those skilled in the art.

Claims

1. The video emotion recognition method fusing facial expression recognition and voice emotion recognition is characterized by comprising the following steps of: the decision-level-based two-process progressive audiovisual emotion recognition method comprises the following specific steps:

secondly, preprocessing of the image frame sequence and the voice signal:

carrying out face T feature point marking on the image frame sequence with the normalized face image size in the second step, wherein T is 68, the positions of the 68 feature points are known, the marked feature point outlines are respectively in the eyes, eyebrows, noses and mouth regions of the face image, and calculating the following 6 specific distances for the u frame image in the image frame sequence with the normalized face image size in the second step according to the coordinates of the T feature points:

And is provided with

V₀＝[D_0,1,D_0,2,D_0,3,D_0,4,D_0,5,D_0,6] (2)，

in the formula (2), D_0,1,D_0,2,D_0,3,D_0,4,D_0,5And D_0,66 specific distances corresponding to neutral frames in the image frame sequence with normalized face image size in the second step;

V_u＝[D_u,1,D_u,2,D_u,3,D_u,4,D_u,5,D_u,6] (3)，

in the formula (3), u is 1,2, …, K-1, where K is the number of face images in the group of image frame sequences with the face image size normalized in the second step, and D_u,1,D_u,2,D_u,3,D_u,4,D_u,5,D_u,66 specific distances corresponding to the u frame in the image frame sequence with normalized face image size in the second step;

in the image frame sequence with the normalized face image size in the second step, the formula (A) is used2) The specific value DF of the specific distance corresponding to each frame of image in the image frame sequence is obtained by the formula (3) and the formula (4)_uScreening for maximum DF_uThe corresponding u-th frame image is the key frame in the image frame sequence,

fourthly, extracting the texture features of the human face:

the LBP-TOP histogram is defined as shown in equation (7),

fifthly, extracting geometric features of the human face:

selecting 10 angles representing face feature change from the T feature points of the key frame obtained by the third step of screening, wherein the eyebrow angle is 2, and the eye angle is 2Calculating 6 angles and 2 angles of the mouth, extracting angle characteristics, and recording the angle characteristics as Q in 10-dimensional angle characteristics₁₀Equation (9) for calculating the angle of the feature point is as follows,

and 5.3, calculating the area characteristics of the face region:

selecting 5 areas of the face image, including left and right eyebrows, two eyes and a mouth, respectively calculating the area characteristics of the 5 areas, and correspondingly subtracting the area of the 5 face areas extracted by the key frame from the area of the 5 face areas extracted by the neutral frame to obtain the change characteristics of the area of the face image area due to the difference of the sizes of the face organs of each person, wherein the total 5 dimensions are recorded as O₅Setting the eyebrow area, mouth area and eye area of human face as triangles, calculating the area of each triangle by using Helen formula, and determining the Euclidean distance characteristic G of the feature point pair of human face₅₀Angular characteristic Q of human face characteristic points₁₀And face region area feature O₅The geometric features F combined as a face are shown in formula (10),

F＝[G₅₀ Q₁₀ O₅] (10)，

sixthly, grouping the facial expressions:

seventhly, classifying the facial expression for the first time:

and eighth, extracting the speech emotion characteristics:

a first group: extracting zero crossing rate ZCR and logarithmic energy LogE,

representing the speech signal plus a hamming window,

is calculated as shown in the following equation (12),

the above log energy LogE is calculated as in equation (15),

the Teager energy operator TEO is defined as shown in equation (16),

in formula (16), ψ [ X (m)]The Teager energy operator TEO for the mth window, X' (m) ═ dX (m)/dm, X "(m) ═ dX²(m)/dm²For signals of constant amplitude and frequency: x (m) acos (Φ m + θ), where a is the signal amplitude, Φ is the signal frequency, θ is the signal initial phase angle,

thereby completing the extraction of the speech emotion characteristics;

and ninthly, classifying for the second time of the speech emotion recognition:

thereby completing the second classification of the speech emotion recognition;

and C, fusing facial expression recognition and voice emotion recognition:

P(Audio_Visual)＝P(Visual)×P(Audio|Visual) (17)，