CN109409296B - Video emotion recognition method integrating facial expression recognition and voice emotion recognition - Google Patents

Video emotion recognition method integrating facial expression recognition and voice emotion recognition Download PDF

Info

Publication number
CN109409296B
CN109409296B CN201811272233.1A CN201811272233A CN109409296B CN 109409296 B CN109409296 B CN 109409296B CN 201811272233 A CN201811272233 A CN 201811272233A CN 109409296 B CN109409296 B CN 109409296B
Authority
CN
China
Prior art keywords
recognition
formula
face
features
frame sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811272233.1A
Other languages
Chinese (zh)
Other versions
CN109409296A (en
Inventor
于明
张冰
郭迎春
于洋
师硕
郝小可
朱叶
阎刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hebei University of Technology
Original Assignee
Hebei University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hebei University of Technology filed Critical Hebei University of Technology
Priority to CN201811272233.1A priority Critical patent/CN109409296B/en
Publication of CN109409296A publication Critical patent/CN109409296A/en
Application granted granted Critical
Publication of CN109409296B publication Critical patent/CN109409296B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/174Facial expression recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/172Classification, e.g. identification
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/57Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for processing of video signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Theoretical Computer Science (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Signal Processing (AREA)
  • Data Mining & Analysis (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Child & Adolescent Psychology (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to a video emotion recognition method integrating facial expression recognition and voice emotion recognition, which relates to the processing of a recording carrier for recognizing graphs and is a decision-level-based audio-visual emotion recognition method with two progressive processes, wherein the method separates the facial expression recognition and the voice emotion recognition in a video, and adopts the two progressive emotion recognition methods to perform the voice emotion recognition on the basis of the facial expression recognition by calculating the conditional probability; the method comprises the following steps: the facial image expression recognition is used as first classification recognition; step B, taking the speech emotion recognition as second classification recognition; and C, fusing the facial expression recognition and the voice emotion recognition. The invention overcomes the defects that the prior art ignores the internal relation between the human face characteristic and the voice characteristic in the human emotion recognition, and the video emotion recognition has low recognition speed and low recognition rate.

Description

Video emotion recognition method integrating facial expression recognition and voice emotion recognition
Technical Field
The technical scheme of the invention relates to processing of a record carrier for recognizing graphs, in particular to a video emotion recognition method fusing facial expression recognition and voice emotion recognition.
Background
With the rapid development of artificial intelligence and computer vision technology, man-machine interaction technology is changing day by day, and human emotion recognition technology by using a computer has received extensive attention, so that how to enable the computer to more quickly and accurately recognize human emotion becomes a research hotspot in the field of machine vision at present.
Human emotion expression modes are various and mainly include facial expressions, speech emotions, upper body postures, language texts and the like. Among them, facial expression and speech emotion are two most typical emotion expression modes. Because the texture and the geometric features of the human face are easy to extract, the emotion recognition method based on the human face expression can achieve higher recognition rate in the current emotion recognition field. However, in some cases where the expressions are similar, such as anger and dislike, fear and surprise, their texture features and geometric features are similar, and the recognition rate is not high only by extracting the features of the facial expressions.
The single-modal emotion recognition method is often limited, and the bi-modal or multi-modal emotion recognition method is more and more a hotspot for research and attention in the emotion recognition research field. The key of the multi-modal emotion recognition method lies in a multi-modal fusion mode, wherein the mainstream fusion mode comprises a feature level fusion mode and a decision level fusion mode.
In 2012, Schuller et al cascaded audio and video features into a single feature vector in the "AVEC: the connected audio/visual annotation challenge", and used support vector regression SVR as a baseline in the AVEC 2012 challenge, and this feature-level fusion method directly cascaded multi-modal features to construct a combined feature vector. Because the large number of multi-modal features is likely to cause dimension disasters, high-dimensional features are easily plagued by the problem of data sparsity, and the advantages of combining audio features and video features in a feature level fusion mode are limited in consideration of the interaction between the features.
The decision-level fusion mode means that a plurality of emotion expression modes can be modeled by corresponding classifiers firstly, then the recognition results of each classifier are fused together to form a decision-level fusion mode, and different modes are combined through the contribution degrees of different emotion expressions under the condition of not increasing the dimension. Seng et al in the paper "A combined rule-based and machine learning audio-visual learning approach" divides the audiovisual emotion recognition into two mutually independent paths to extract features respectively, then models on respective classifiers to obtain respective corresponding recognition rates, and finally obtains the final recognition rate according to a proportional scoring mechanism and corresponding weight distribution. The existing decision-level fusion method has two main disadvantages, namely: the proportion scoring mechanism and the weight distribution strategy lack a uniform authoritative standard, and different researchers often obtain different identification results on the same research project according to various proportion scoring mechanisms and different weight distribution strategies; secondly, the method comprises the following steps: the decision-level fusion mode focuses on the fusion of the face recognition result and the voice recognition result, and ignores the internal relation between the face feature and the voice feature.
CN106529504A discloses a bimodal video emotion recognition method of composite space-time characteristics, which expands the existing volume local binary pattern algorithm into a space-time ternary pattern, acquires space-time local ternary pattern moment texture characteristics of human face expression and upper body posture, further fuses three-dimensional gradient direction histogram characteristics to enhance description of emotion video, and combines the two characteristics into composite space-time characteristics.
CN105512609A discloses a multi-mode fusion video emotion recognition method based on a kernel-based ultralimit learning machine, which is used for carrying out feature extraction and feature selection on image information and audio information of a video so as to obtain video features; preprocessing, feature extraction and feature selection are carried out on the acquired multi-channel electroencephalogram signals, so that electroencephalogram features are obtained; establishing a multi-mode fusion video emotion recognition model based on a kernel overrun learning machine; and inputting the video characteristics and the electroencephalogram characteristics into a multi-mode fusion video emotion recognition model based on a kernel-based over-limit learning machine to perform video emotion recognition, and obtaining the final classification accuracy. However, the algorithm has the defects of high classification recognition rate and low usability only for three types of video emotion data.
CN103400145B discloses a speech-vision fusion emotion recognition method based on clue neural network, which firstly uses the feature data of three channels of human face expression, side face expression and speech to train a neural network independently to execute the recognition of discrete emotion categories, 4 clue nodes are added in the output layer of the neural network model in the training process to respectively bear clue information of 4 coarse-grained categories in activity-evaluation degree space, then a multi-mode fusion model is used to fuse the output results of the three neural networks, the multi-mode fusion model also adopts the neural network trained based on the clue information, however, in most videos, the number of frames of facial side face expression frames is small, and effective collection is difficult to carry out, so that the method has great limitation on the practical operation. The method also relates to training and fusion of the neural network, and as the data volume increases and the data dimension increases, the consumption of training time and resources gradually increases, and the error rate also gradually increases.
CN105138991B discloses a video emotion recognition method based on emotion significance feature fusion, which extracts audio features and visual emotion features from each video shot in a training video set, wherein the audio features form emotion distribution histogram features based on a word bag model; the visual emotion characteristics form emotion attention characteristics based on a visual dictionary, and the emotion attention characteristics and the emotion distribution histogram characteristics are fused from top to bottom to form video characteristics with emotion significance. The method only extracts the characteristics of the video key frames when extracting the visual emotion characteristics, and ignores the association relationship of the characteristics between the video frames to a certain extent.
Disclosure of Invention
The technical problem to be solved by the invention is as follows: the invention provides a video emotion recognition method fusing facial expression recognition and voice emotion recognition, which is a decision-level-based two-process progressive audiovisual emotion recognition method.
The technical scheme adopted by the invention for solving the technical problem is as follows: a video emotion recognition method fusing facial expression recognition and voice emotion recognition is a decision-level-based two-process progressive audiovisual emotion recognition method, and specifically comprises the following steps:
process a. facial image expression recognition is used as a first classification recognition:
the process A comprises the steps of extracting facial expression characteristics, grouping facial expressions and first-time classification of facial expression recognition, and comprises the following steps:
firstly, video frame extraction and voice signal extraction are carried out on a video signal:
decomposing the video in the database into an image frame sequence, performing video frame extraction by utilizing open-source FormatFactory software, and extracting and storing the voice signal in the video into an MP3 format;
secondly, preprocessing of the image frame sequence and the voice signal:
positioning and cutting the human face of the image frame sequence obtained in the first step by using a disclosed Viola & Jones algorithm, and normalizing the size of the cut human face image into M multiplied by M pixels to obtain an image frame sequence with the normalized size of the human face image;
carrying out voice detection on the voice signals obtained in the first step by utilizing a known voice endpoint detection algorithm VAD and removing noise and silence segments to obtain voice signals with characteristics easier to extract;
thereby completing the pre-processing of the image frame sequence and the voice signal;
thirdly, marking the human face characteristic points in the image frame sequence and screening key frames in the image frame sequence:
carrying out face T feature point marking on the image frame sequence with the second step of face image size normalization, wherein T is 68, the positions of the 68 feature points are known, the marked feature point outlines are respectively in the eyes, eyebrows, noses and mouth regions of the face image, and the following 6 specific distances are calculated for the u frame image in the image frame sequence with the second step of face image size normalization according to the coordinates of the T feature points:
the distance between the eyes and the eyebrows in the vertical direction is Du,1:Du,1=dvertical||p22,p40||,
The distance in the vertical direction of the opening of the eye is Du,2:Du,2=dvertical||p45,p47||,
The distance between the eyes and the mouth in the vertical direction is Du,3:Du,3=dvertical||p37,p49||,
The distance D in the vertical direction between the nose and the mouthu,4:Du,4=dvertical||p34,p52||,
The distance in the vertical direction of the upper and lower lips is Du,5:Du,5=dvertical||p52,p58||,
The two sides of the mouth have a width distance D in the horizontal directionu,6:Du,6=dhorizontal||p49,p55||,
And is provided with
dvertical||pi,pj||=|pj,y-pi,y|,dhorizontal||pi,pj||=|pj,x-pi,x| (1),
In the formula (1), piIs the coordinate set of the ith feature point, pjIs the coordinate set of the jth feature point, pi,yIs the ordinate, p, of the i-th feature pointj,yIs the ordinate of the jth feature point, pi,xIs the abscissa, p, of the ith feature pointj,xIs the abscissa of the jth feature point, dvertical||pi,pjI is the vertical distance between feature points i and j, dhorizontal||pi,pjI | is the horizontal distance between feature points i and j, i ═ 1,2, …,68, j ═ 1,2, …, 68;
setting the first frame in the image frame sequence with normalized human face image size in the second step as a neutral frame, and the set V of 6 specific distances0As shown in the formula (2),
V0=[D0,1,D0,2,D0,3,D0,4,D0,5,D0,6] (2),
in the formula (2), D0,1,D0,2,D0,3,D0,4,D0,5And D 0,66 specific distances corresponding to neutral frames in the image frame sequence with normalized face image size in the second step;
the set V of 6 specific distances of the u frame in the image frame sequence with normalized face image size in the second stepuAs shown in the formula (3),
Vu=[Du,1,Du,2,Du,3,Du,4,Du,5,Du,6] (3),
in the formula (3), u is 1,2, …, K-1, where K is the number of face images in the group of image frame sequences with the face image size normalized in the second step, and Du,1,Du,2,Du,3,Du,4,Du,5,D u,66 specific distances corresponding to the u frame in the image frame sequence with normalized face image size in the second step;
the sum of the 6 corresponding specific distance ratios of the u-th frame and the neutral frame in the image frame sequence with normalized face image size in the second step is shown in formula (4),
Figure GDA0002714810730000041
in formula (4), DFuRepresenting the sum of the ratios of 6 specific distances corresponding to the neutral frame image and the u frame image in the image frame sequence with the normalized face image size in the second step, n represents the number of 6 specific distances, D0,nRepresenting the n-th specific distance, D, corresponding to the neutral frame in the image frame sequence with normalized face image size in the second stepu,nRepresenting the nth specific distance corresponding to the u frame in the image frame sequence with the normalized face image size in the second step;
in the image frame sequence with normalized face image size in the second step, the ratio DF of the specific distance corresponding to each frame image in the image frame sequence is obtained according to the formula (2), the formula (3) and the formula (4)uScreening for maximum DFuThe corresponding u-th frame image is the key frame in the image frame sequence,
thus, marking the facial feature points of the image frame sequence and screening key frames in the image frame sequence;
fourthly, extracting the texture features of the human face:
extracting human face texture features by using an LBP-TOP algorithm, firstly, dividing an image frame sequence with the normalized human face image size in the second step into XY, XT and YT orthogonal planes in space-time, calculating LBP values of central pixel points in 3 x 3 neighborhoods in each orthogonal plane, counting LBP histogram features of the three orthogonal planes, and finally connecting the LBP histograms of the three orthogonal planes to form an integral feature vector, wherein the LBP operator calculation method is shown as a formula (5) and a formula (6),
Figure GDA0002714810730000042
Figure GDA0002714810730000043
in the formula (5) and the formula (6), Z is the number of neighborhood points of the central pixel point, R is the distance between the neighborhood points and the central pixel point, and t iscPixel value, t, of the central pixel pointqIs the pixel value of the q-th neighborhood point, Sig (t)q-tc) The LBP coding value for the qth neighborhood point,
the LBP-TOP histogram is defined as shown in equation (7),
Figure GDA0002714810730000051
in formula (7), b is the number of the plane, b is 0 and XY plane, b is 1 and XT plane, b is 2 and YT plane, n isbIs the number of binary patterns produced by the LBP operator on the b-th plane, I { LBP }Z,R,b(x, y, t) ═ a } employing LBP in the (b) th planeZ,RWhen the operator carries out feature extraction, the number of pixel points with the LBP coding value of a is counted;
fifthly, extracting geometric features of the human face:
calculating coordinates of T feature points marked in the key frame according to the key frame in the screened image frame sequence obtained in the third step to obtain geometric features of the human face expression, wherein in the field of human face expression recognition, the region with the most abundant human face features is a human face T-shaped region and mainly comprises eyebrow, eye, nose, chin and mouth regions, so that the extraction method of the geometric features of the human face mainly extracts distance features between the mark points of the human face T-shaped region;
and 5.1, calculating the Euclidean distance characteristics of the face characteristic point pairs:
selecting 14 pairs of feature points of eyebrows, 12 pairs of feature points of eyes, 12 pairs of feature points of mouths, 6 pairs of feature points of noses and 6 pairs of feature points of chins from T feature points in key frames in the screening image frame sequence obtained in the third step, totaling 50 pairs of feature points, calculating Euclidean distance between feature point pairs A and B, totaling 50 dimension Euclidean distance features, and marking as G50Calculating characteristicsEquation (8) characterizing the euclidean distance between points a and B is shown below,
Figure GDA0002714810730000052
in the formula (8), pAIs a set of coordinates, p, of the feature point ABIs a set of coordinates, p, of the feature point BA,xIs the abscissa, p, of the feature point AA,yIs the ordinate, p, of the characteristic point AB,xIs the abscissa, p, of the feature point BB,yIs the ordinate of the characteristic point B;
and 5.2, calculating the angle characteristics of the face characteristic points:
selecting 10 angles representing face feature changes from the T feature points of the key frame obtained by screening in the third step, wherein 2 angles of eyebrows, 6 angles of eyes and 2 angles of mouths are calculated, extracting angle features, and 10-dimensional angle features in total are recorded as Q10Equation (9) for calculating the angle of the feature point is as follows,
Figure GDA0002714810730000053
in formula (9), pC、pD、pEThree feature point coordinate sets corresponding to the angles formed by the eyebrow, the eye and the mouth regions marked on the face feature points by the third step, wherein p isDA vertex point coordinate set;
and 5.3, calculating the area characteristics of the face region:
selecting 5 areas of the face image, including left and right eyebrows, two eyes and a mouth, respectively calculating the area characteristics of the 5 areas, and correspondingly subtracting the area of the 5 face areas extracted by the key frame from the area of the 5 face areas extracted by the neutral frame to obtain the change characteristics of the area of the face image area due to the difference of the sizes of the face organs of each person, wherein the total 5 dimensions are recorded as O5Setting the eyebrow area, mouth area and eye area of human face as triangles, and calculating each triangle surface by using Helen formulaProduct, the Euclidean distance characteristic G of the face characteristic point pair50Angular characteristic Q of human face characteristic points10And face region area feature O5The geometric features F combined as a face are shown in formula (10),
F=[G50 Q10 O5] (10),
at this point, the facial texture features and the facial geometric features are connected in series to complete the extraction of the facial expression features;
sixthly, grouping the facial expressions:
the six emotions of the human face are: surprise, fear, anger, aversion, happiness and sadness, which are divided into three groups in pairs, wherein the groups are as follows:
a first group: surprise and fear; second group: angry and aversion; third group: happy and sad;
seventhly, classifying the facial expression for the first time:
putting the facial expression features extracted in the fourth step and the fifth step into an ELM classifier for training and testing, thereby finishing the first classification of facial expression recognition and obtaining the recognition result of the first classification of the facial expression recognition, wherein the parameters of the ELM are set as: ELM type: "classification", number of hidden layer neurons: "20", activation function: a "Sigmoid" function;
and B, taking the speech emotion recognition as a second classification recognition:
in the process B, based on the facial expression recognition result in the process a, the speech features are combined, and the speech emotion feature extraction and the second classification of the speech emotion recognition are performed on each of the three groups in the sixth facial expression grouping, specifically, the operations are as follows:
and eighth, extracting the speech emotion characteristics:
and aiming at the classification result of the first classification of the facial expression recognition in the seventh step, according to the grouping in the sixth step, different prosodic features are respectively extracted according to the different sensitivity degrees of the emotions of each group to different audio prosodic features:
a first group: extracting zero crossing rate ZCR and logarithmic energy LogE,
second group: extracting a Teager energy operator TEO, a zero-crossing rate ZCR and a logarithmic energy LogE,
third group: extracting Pitch Pitch, zero-crossing rate ZCR and Teager energy operator TEO,
the Pitch height Pitch of the prosodic feature described above is calculated in the frequency domain,
for the speech signal M preprocessed in the above-described second step, the Pitch Pitch is calculated by the following equation (11),
Figure GDA0002714810730000061
in equation (11), Pitch is Pitch, DFT is a discrete Fourier transform function, LMWhich represents the length of the speech signal and,
Figure GDA0002714810730000062
representing the speech signal plus a hamming window,
Figure GDA0002714810730000063
is calculated as shown in the following equation (12),
Figure GDA0002714810730000064
in the formula (12), N is the number of Hamming windows, and m is the mth Hamming window;
the zero crossing rate ZCR in the prosodic feature is calculated as shown in equation (13),
Figure GDA0002714810730000065
in formula (13), ZCR represents the average zero-crossing rate of N windows, | | is the absolute value sign, x (m) is the speech signal of the mth window after framing and windowing, sgn { x (m) } function judges the sign of the speech amplitude, sgn { x (m) } function is calculated by formula (14),
Figure GDA0002714810730000071
in formula (14), x (m) is the speech signal of the mth window after the framing windowing;
the above log energy LogE is calculated as in equation (15),
Figure GDA0002714810730000072
in formula (15), LogE represents the total logarithmic energy of N windows, x (m) is the speech signal of the mth window after framing and windowing, and N is the number of windows;
the Teager energy operator TEO is defined as shown in equation (16),
Figure GDA0002714810730000073
in formula (16), ψ [ X (m)]The Teager energy operator TEO for the mth window, X' (m) ═ dX (m)/dm, X "(m) ═ dX2(m)/dm2For signals of constant amplitude and frequency: x (m) is a cos (Φ m + θ), where a is the signal amplitude, Φ is the signal frequency, θ is the signal initial phase angle,
extracting well-known Mel Frequency Cepstrum Coefficients (MFCCs) and first-order difference features and second-order difference features of the well-known Mel Frequency Cepstrum Coefficients (MFCCs) and the first-order difference features of the audio files corresponding to the image frames of each of the three groups in the sixth step of grouping of the facial expressions, and finally connecting the extracted prosody features of each group, the corresponding Mel Frequency Cepstrum Coefficients (MFCCs) and the first-order difference features and the second-order difference features of the prosody features in series to form mixed audio features,
thereby completing the extraction of the speech emotion characteristics;
and ninthly, classifying for the second time of the speech emotion recognition:
putting the speech emotion characteristics extracted in the eighth step into an SVM for training and testing, and finally obtaining the recognition rate of speech emotion recognition, wherein the parameters of the SVM are set as: penalty coefficient: "95", allowing redundant outputs: "0", nuclear parameters: "1", kernel function of support vector machine: the expression "Gaussian nucleus",
thereby completing the second classification of the speech emotion recognition;
and C, fusing facial expression recognition and voice emotion recognition:
and step ten, fusing the facial expression recognition and the voice emotion recognition on a decision level:
since the speech emotion recognition in the process B is a secondary recognition performed on the basis of the face emotion recognition in the process a, the relationship between the two recognition rates belongs to a conditional probability relationship, and the final recognition rate P (Audio _ Visual) calculation method is as shown in formula (17),
P(Audio_Visual)=P(Visual)×P(Audio|Visual) (17),
in formula (17), P (Visual) is the recognition rate of the first face image recognition, and P (Audio/Visual) is the recognition rate of the second speech emotion;
and finishing the progressive video emotion recognition based on two decision-level processes, namely facial expression recognition and voice emotion recognition fusion.
In the video emotion recognition method combining facial expression recognition and speech emotion recognition, in the third step, the coordinates of the T feature points are determined, where T is 68.
In the video emotion recognition method combining facial expression recognition and Voice emotion recognition, the Voice endpoint Detection algorithm is called Voice Activity Detection (Voice Activity Detection) in english, and is abbreviated as VAD; the English of Zero-Crossing Rate is Zero-Crossing Rate, abbreviated as ZCR; english of logarithmic energy is LogEnergy, abbreviated LogE; the Mel frequency cepstral coefficients are Mel-frequency cepstral coefficients in English, and are abbreviated as MFCC; the Teager Energy Operator is called Tea in English, and the speech endpoint detection algorithm, zero-crossing rate, logarithmic Energy, mel-frequency cepstrum coefficient and the Teager Energy Operator are well known in the technical field.
The above video emotion recognition method combining facial expression recognition and speech emotion recognition is a calculation operation method that can be grasped by those skilled in the art.
The invention has the beneficial effects that: compared with the prior art, the invention has the prominent substantive characteristics and remarkable progress as follows:
(1) the invention provides a video emotion recognition method fusing facial expression recognition and voice emotion recognition, which is an audio-visual emotion recognition method based on decision level, the method separates the facial expression recognition and the voice emotion recognition in the video, adopts a progressive emotion recognition method of two processes, and adopts a conditional probability calculation mode, the speech emotion recognition technology which is carried out on the basis of the facial expression recognition fully considers the influence of the result of the facial expression recognition on the speech emotion recognition, the human face expression recognition and the voice emotion recognition are fused more closely and mutually assisted, so that a more ideal human emotion recognition effect is obtained, and the defects that in the prior art, the internal relation between the human face features and the voice features is ignored in the human emotion recognition, the recognition speed of the video emotion recognition is low and the recognition rate is low are overcome.
(2) Chien et al have an experiment of 'audio characteristic analysis' in a 'new approach of audio observation recognition' in 2014, and the experiment demonstrates that 6 emotions have different sensitivity degrees on rhythm characteristics, namely Pitch, Zero-Crossing Rate, Loginergy and Teager Energy Operator. The paper classifies Mel-Frequency Cepstral Coefficients, MFCC extracted from audio by using an SVM classifier, the recognition rate is gradually reduced when two-classification, four-classification and six-classification are carried out, the classification effect of the classifier is better when the classified classes are less, therefore, in the invention, three classifications are selected during the first face expression classification, and the second classification is selected during the second audio classification. The method simplifies the multi-classification problem into a three-classification problem and a two-classification problem, thereby reducing the feature dimension, shortening the training time and greatly improving the efficiency of the algorithm.
(3) Compared with CN106529504A, the method of the invention has the advantages that the method not only extracts the face characteristics in the video, but also extracts the audio characteristics in the video, and the bimodal combination of the face characteristics and the audio characteristics is beneficial to more accurately identifying the emotion of the person in the video.
(4) Compared with CN105512609A, the method provided in CN105512609A can only identify three emotions in a video, but the method can identify six emotions in the video, and the average identification rate of the method is 9.92% higher than that of the video in CN 105512609A.
(5) Compared with CN105138991A, the method of the invention has the advantages of classifying the human face features and the voice features respectively, avoiding the 'dimension disaster' easily caused by feature level fusion, along with simple operation and faster training and recognition speed in the fusion method at decision level.
(6) When the audio features are extracted, the different sensibilities of different emotions to different audio features are considered, so that different audio features are extracted from each group, and the second classification based on the voice features is facilitated.
(7) The method extracts texture, geometry, time and rhythm characteristics, different characteristics reflect different characteristics of expressions, and a classifier can be well trained to perform video emotion recognition from multiple modes.
(8) The invention applies a secondary progressive emotion classification method, mainly uses face recognition and auxiliary voice recognition, and the face recognition and the auxiliary voice recognition are complementary and mutually assisted, thereby realizing more accurate video emotion recognition.
Drawings
The invention is further illustrated with reference to the following figures and examples.
FIG. 1 is a block diagram showing the flow of the method of the present invention.
Fig. 2 is a schematic diagram of the labeling of 6 specific distances and 68 feature points of a human face.
Fig. 3 is a diagram of an example of 68 feature point labels for a face in the eNTERFACE' 05 database.
Detailed Description
The embodiment shown in fig. 1 shows that the process of the method of the present invention comprises: process A.
Figure GDA0002714810730000091
Figure GDA0002714810730000092
Figure GDA0002714810730000093
And (B) carrying out the process.
Figure GDA0002714810730000094
Figure GDA0002714810730000095
And (C) carrying out the process.
Figure GDA0002714810730000096
Figure GDA0002714810730000097
The embodiment shown in fig. 2 shows the labels of 68 feature points of the human face and 6 specific distances, which are sequentially the vertical distance between the feature points 22 and 40 and the distance D, and is an exemplary image labeled by the feature pointsu,1The vertical distance between feature points 45 and 47 is denoted by Du,2Indicating that the vertical distance between feature points 37 and 49 is in Du,3The vertical distance between feature points 34 and 52 is denoted by Du,4The vertical distance between feature points 52 and 58 is denoted by Du,5The horizontal distance between feature points 49 and 55 is denoted by Du,6And (4) showing. The connecting lines between the feature points in the figure outline the eyebrow, eye and mouth regions of the human face.
The embodiment shown in fig. 3 shows an example of labeling a face in the eNTERFACE' 05 database with Dlib feature points, where 68 feature points labeled in the figure correspond to the labels of the 68 feature points shown in the schematic diagram of labeling the face feature points in fig. 2.
Example 1
The video emotion recognition method fusing facial expression recognition and voice emotion recognition in the embodiment is a decision-level-based two-process progressive audiovisual emotion recognition method, and specifically comprises the following steps:
process a. facial image expression recognition is used as a first classification recognition:
the process A comprises the steps of extracting facial expression characteristics, grouping facial expressions and first-time classification of facial expression recognition, and comprises the following steps:
firstly, video frame extraction and voice signal extraction are carried out on a video signal:
decomposing the video in the database into an image frame sequence, performing video frame extraction by utilizing open-source FormatFactory software, and extracting and storing the voice signal in the video into an MP3 format;
secondly, preprocessing of the image frame sequence and the voice signal:
positioning and cutting the human face of the image frame sequence obtained in the first step by using a disclosed Viola & Jones algorithm, and normalizing the size of the cut human face image into M multiplied by M pixels to obtain an image frame sequence with the normalized size of the human face image;
carrying out voice detection on the voice signals obtained in the first step by utilizing a known voice endpoint detection algorithm VAD and removing noise and silence segments to obtain voice signals with characteristics easier to extract;
thereby completing the pre-processing of the image frame sequence and the voice signal;
thirdly, marking the human face characteristic points in the image frame sequence and screening key frames in the image frame sequence:
marking the image frame sequence with the normalized size of the face image in the second step with T feature points, wherein the value range of T is 1,2, …,68, the positions of the 68 feature points are known, the outlines of the marked feature points are in the eye, eyebrow, nose and mouth regions of the face image respectively, and in this embodiment, the following 6 specific distances are calculated for the u frame image in the image frame sequence with the normalized size of the face image in the second step according to the coordinates of T ═ 68 feature points:
the distance between the eyes and the eyebrows in the vertical direction is Du,1:Du,1=dvertical||p22,p40||,
The distance in the vertical direction of the opening of the eye is Du,2:Du,2=dvertical||p45,p47||,
The distance between the eyes and the mouth in the vertical direction is Du,3:Du,3=dvertical||p37,p49||,
The distance D in the vertical direction between the nose and the mouthu,4:Du,4=dvertical||p34,p52||,
The distance in the vertical direction of the upper and lower lips is Du,5:Du,5=dvertical||p52,p58||,
The two sides of the mouth have a width distance D in the horizontal directionu,6:Du,6=dhorizontal||p49,p55||,
And is provided with
dvertical||pi,pj||=|pj,y-pi,y|,dhorizontal||pi,pj||=|pj,x-pi,x| (1),
In the formula (1), piIs the coordinate set of the ith feature point, pjIs the coordinate set of the jth feature point, pi,yIs the ordinate, p, of the i-th feature pointj,yIs the ordinate of the jth feature point, pi,xIs the abscissa, p, of the ith feature pointj,xIs the abscissa of the jth feature point, dvertical||pi,pjI is the vertical distance between feature points i and j, dhorizontal||pi,pjI | is the horizontal distance between feature points i and j, i ═ 1,2, …,68, j ═ 1,2, …, 68;
setting the first frame in the image frame sequence with normalized human face image size in the second step as a neutral frame, and the set V of 6 specific distances0As shown in the formula (2),
V0=[D0,1,D0,2,D0,3,D0,4,D0,5,D0,6] (2),
in the formula (2), D0,1,D0,2,D0,3,D0,4,D0,5And D 0,66 specific distances corresponding to neutral frames in the image frame sequence with normalized face image size in the second step;
the set V of 6 specific distances of the u frame in the image frame sequence with normalized face image size in the second stepuAs shown in the formula (3),
Vu=[Du,1,Du,2,Du,3,Du,4,Du,5,Du,6] (3),
in the formula (3), u is 1,2, …, K-1, where K is the number of face images in the group of image frame sequences with the face image size normalized in the second step, and Du,1,Du,2,Du,3,Du,4,Du,5,D u,66 specific distances corresponding to the u frame in the image frame sequence with normalized face image size in the second step;
the sum of the 6 corresponding specific distance ratios of the u-th frame and the neutral frame in the image frame sequence with normalized face image size in the second step is shown in formula (4),
Figure GDA0002714810730000111
in formula (4), DFuRepresenting the sum of the ratios of 6 specific distances corresponding to the neutral frame image and the u frame image in the image frame sequence with the normalized face image size in the second step, n represents the number of 6 specific distances, D0,nRepresenting the n-th specific distance, D, corresponding to the neutral frame in the image frame sequence with normalized face image size in the second stepu,nRepresenting the nth specific distance corresponding to the u frame in the image frame sequence with the normalized face image size in the second step;
in the image frame sequence with normalized face image size in the second step, each image frame in the image frame sequence is obtained according to the formula (2), the formula (3) and the formula (4)Ratio DF of specific distances corresponding to the imagesuScreening for maximum DFuThe corresponding u-th frame image is the key frame in the image frame sequence,
thus, marking the facial feature points of the image frame sequence and screening key frames in the image frame sequence;
fourthly, extracting the texture features of the human face:
extracting human face texture features by using an LBP-TOP algorithm, firstly, dividing an image frame sequence with the normalized human face image size in the second step into XY, XT and YT orthogonal planes in space-time, calculating LBP values of central pixel points in 3 x 3 neighborhoods in each orthogonal plane, counting LBP histogram features of the three orthogonal planes, and finally connecting the LBP histograms of the three orthogonal planes to form an integral feature vector, wherein the LBP operator calculation method is shown as a formula (5) and a formula (6),
Figure GDA0002714810730000112
Figure GDA0002714810730000113
in the formula (5) and the formula (6), Z is the number of neighborhood points of the central pixel point, R is the distance between the neighborhood points and the central pixel point, and t iscPixel value, t, of the central pixel pointqIs the pixel value of the q-th neighborhood point, Sig (t)q-tc) The LBP coding value for the qth neighborhood point,
the LBP-TOP histogram is defined as shown in equation (7),
Figure GDA0002714810730000114
in formula (7), b is the number of the plane, b is 0 and XY plane, b is 1 and XT plane, b is 2 and YT plane, n isbIs the number of binary patterns produced by the LBP operator on the b-th plane, I { LBP }Z,R,b(x, y, t) ═ a } isUsing LBP in the b-th planeZ,RWhen the operator carries out feature extraction, the number of pixel points with the LBP coding value of a is counted;
fifthly, extracting geometric features of the human face:
calculating coordinates of T feature points marked in the key frame according to the key frame in the screened image frame sequence obtained in the third step to obtain geometric features of the human face expression, wherein in the field of human face expression recognition, the region with the most abundant human face features is a human face T-shaped region and mainly comprises eyebrow, eye, nose, chin and mouth regions, and specific feature points are shown in a table 2, so that the extraction method of the geometric features of the human face mainly extracts distance features between the mark points of the human face T-shaped region;
and 5.1, calculating the Euclidean distance characteristics of the face characteristic point pairs:
selecting 14 pairs of feature points of eyebrows, 12 pairs of feature points of eyes, 12 pairs of feature points of mouths, 6 pairs of feature points of noses and 6 pairs of feature points of chins from T feature points in key frames in the screening image frame sequence obtained in the third step, totaling 50 pairs of feature points, calculating Euclidean distance between feature point pairs A and B, totaling 50 dimension Euclidean distance features, and marking as G50Equation (8) for calculating the euclidean distance between the characteristic point pairs a and B is as follows,
Figure GDA0002714810730000121
in the formula (8), pAIs a set of coordinates, p, of the feature point ABIs a set of coordinates, p, of the feature point BA,xIs the abscissa, p, of the feature point AA,yIs the ordinate, p, of the characteristic point AB,xIs the abscissa, p, of the feature point BB,yIs the ordinate of the characteristic point B;
table 1 shows the pairs of face feature points to be calculated in the face T-shaped region, where d | | pA,pB| | represents the euclidean distance between pairs of feature points A, B;
TABLE 1
Figure GDA0002714810730000122
And 5.2, calculating the angle characteristics of the face characteristic points:
selecting 10 angles representing face feature changes from the 68 feature points of the key frame obtained by screening in the third step, wherein 2 angles of eyebrows, 6 angles of eyes and 2 angles of mouths are calculated, extracting angle features, and recording the angle features as Q in 10 dimensions10The specific angle is shown in Table 3, and the formula (9) for calculating the angle of the feature point is as follows,
Figure GDA0002714810730000123
in formula (9), pC、pD、pEThree feature point coordinate sets corresponding to the angles formed by the eyebrow, the eye and the mouth regions marked on the face feature points by the third step, wherein p isDA vertex point coordinate set;
table 2 shows the angles of the face feature points to be calculated in the face T-shaped region. Wherein Q (p)C,pD,pE) An angular characteristic representing an angle D;
TABLE 2
Figure GDA0002714810730000131
And 5.3, calculating the area characteristics of the face region:
selecting 5 areas of the face image, including left and right eyebrows, two eyes and a mouth, and respectively calculating the area characteristics of the 5 areas, wherein the specific area areas are shown in a table 3;
TABLE 3
Figure GDA0002714810730000132
Table 3 shows the area of the region surrounded by the face feature points to be calculated in the face T-shaped region,wherein O (p)A,pB,pC,pD) Represents the area of the region surrounded by the line connecting the characteristic points A, B, C, D;
because of the difference of the sizes of the human face organs of each person, the area of 5 human face regions extracted from the key frame in the table 4 is correspondingly subtracted from the area of 5 human face regions extracted from the neutral frame to obtain the change characteristic of the area of the human face image region, and the total 5 dimensions are marked as O5Setting the eyebrow area, mouth area and eye area of human face as triangles, calculating the area of each triangle by using Helen formula, and determining the Euclidean distance characteristic G of the feature point pair of human face50Angular characteristic Q of human face characteristic points10And face region area feature O5The geometric features F combined as a face are shown in formula (10),
F=[G50 Q10 O5] (10),
at this point, the facial texture features and the facial geometric features are connected in series to complete the extraction of the facial expression features;
sixthly, grouping the facial expressions:
the six emotions of the human face are: surprise, fear, anger, aversion, happiness and sadness, which are divided into three groups in pairs, wherein the groups are as follows:
a first group: surprise and fear; second group: angry and aversion; third group: happy and sad;
seventhly, classifying the facial expression for the first time:
putting the facial expression features extracted in the fourth step and the fifth step into an ELM classifier for training and testing, thereby finishing the first classification of facial expression recognition and obtaining the recognition result of the first classification of the facial expression recognition, wherein the parameters of the ELM are set as: ELM type: "classification", number of hidden layer neurons: "20", activation function: a "Sigmoid" function;
and B, taking the speech emotion recognition as a second classification recognition:
in the process B, based on the facial expression recognition result in the process a, the speech features are combined, and the speech emotion feature extraction and the second classification of the speech emotion recognition are performed on each of the three groups in the sixth facial expression grouping, specifically, the operations are as follows:
and eighth, extracting the speech emotion characteristics:
and aiming at the classification result of the first classification of the facial expression recognition in the seventh step, according to the grouping in the sixth step, different prosodic features are respectively extracted according to the different sensitivity degrees of the emotions of each group to different audio prosodic features:
a first group: extracting zero crossing rate ZCR and logarithmic energy LogE,
second group: extracting a Teager energy operator TEO, a zero-crossing rate ZCR and a logarithmic energy LogE,
third group: extracting Pitch Pitch, zero-crossing rate ZCR and Teager energy operator TEO,
the Pitch height Pitch of the prosodic feature described above is calculated in the frequency domain,
for the speech signal M preprocessed in the above-described second step, the Pitch Pitch is calculated by the following equation (11),
Figure GDA0002714810730000141
in equation (11), Pitch is Pitch, DFT is a discrete Fourier transform function, LMWhich represents the length of the speech signal and,
Figure GDA0002714810730000142
representing the speech signal plus a hamming window,
Figure GDA0002714810730000143
is calculated as shown in the following equation (12),
Figure GDA0002714810730000144
in the formula (12), N is the number of Hamming windows, and m is the mth Hamming window;
the zero crossing rate ZCR in the prosodic feature is calculated as shown in equation (13),
Figure GDA0002714810730000145
in formula (13), ZCR represents the average zero-crossing rate of N windows, | | is the absolute value sign, x (m) is the speech signal of the mth window after framing and windowing, sgn { x (m) } function judges the sign of the speech amplitude, sgn { x (m) } function is calculated by formula (14),
Figure GDA0002714810730000146
in formula (14), x (m) is the speech signal of the mth window after the framing windowing;
the above log energy LogE is calculated as in equation (15),
Figure GDA0002714810730000147
in formula (15), LogE represents the total logarithmic energy of N windows, x (m) is the speech signal of the mth window after framing and windowing, and N is the number of windows;
the Teager energy operator TEO is defined as shown in equation (16),
Figure GDA0002714810730000151
in formula (16), ψ [ X (m)]The Teager energy operator TEO for the mth window, X' (m) ═ dX (m)/dm, X "(m) ═ dX2(m)/dm2For signals of constant amplitude and frequency: x (m) is a cos (Φ m + θ), where a is the signal amplitude, Φ is the signal frequency, θ is the signal initial phase angle,
extracting well-known Mel Frequency Cepstrum Coefficients (MFCCs) and first-order difference features and second-order difference features of the well-known Mel Frequency Cepstrum Coefficients (MFCCs) and the first-order difference features of the audio files corresponding to the image frames of each of the three groups in the sixth step of grouping of the facial expressions, and finally connecting the extracted prosody features of each group, the corresponding Mel Frequency Cepstrum Coefficients (MFCCs) and the first-order difference features and the second-order difference features of the prosody features in series to form mixed audio features,
thereby completing the extraction of the speech emotion characteristics;
and ninthly, classifying for the second time of the speech emotion recognition:
putting the speech emotion characteristics extracted in the eighth step into an SVM for training and testing, and finally obtaining the recognition rate of speech emotion recognition, wherein the parameters of the SVM are set as: penalty coefficient: "95", allowing redundant outputs: "0", nuclear parameters: "1", kernel function of support vector machine: the expression "Gaussian nucleus",
thereby completing the second classification of the speech emotion recognition;
and C, fusing facial expression recognition and voice emotion recognition:
and step ten, fusing the facial expression recognition and the voice emotion recognition on a decision level:
because the speech emotion recognition is the secondary recognition based on the face emotion recognition, the relationship of the two recognition rates belongs to the relationship of the conditional probability, the final recognition rate P (Audio _ Visual) calculation method is shown as the formula (17),
P(Audio_Visual)=P(Visual)×P(Audio|Visual) (17),
in formula (17), P (Visual) is the recognition rate of the first face image recognition, and P (Audio | Visual) is the recognition rate of the second speech emotion;
and finishing the progressive video emotion recognition based on two decision-level processes, namely facial expression recognition and voice emotion recognition fusion.
In this example, the results of comparison experiments with the prior art on the database of eTERFACE' 05 and RML show that the specific recognition rates are shown in Table 4 below:
TABLE 4
Figure GDA0002714810730000152
The experimental results in Table 4 show the comparison of the recognition rates of the audio visual emotion recognition system in the last few years on the eNewFACE' 05 and RML databases: the average recognition rate of Audiovisual emotion recognition on the eNFERFACE' 05 database in the "Audio recording recognition method and multi-classifier neural networks" document by Mahdi Bejani et al, 2014 was 77.78%;
the average Recognition rate of audiovisual Emotion Recognition performed on the RML database in "Multimodal Deep social Network for Audio-Visual Emotion Recognition" document by Shiqing Zhang et al 2016 is 74.32%;
the average Recognition rates of audiovisual Emotion Recognition by Shiqing Zhang et al 2017 on the eNTERFACE' 05 and RML databases in the "Learning affinity Features with a Hybrid Deep Model for Audio-Visual Emotion Recognition" document are 85.97% and 80.36%, respectively;
the average recognition rates of audiovisual emotion recognition on the eNBACE' 05 and RML databases in "Audio-visual experience fusion (AVEF): A deep visual weighted approach" literature by Yaxiong Ma et al in 2018 were 84.56% and 81.98%, respectively; the decision-level-based two-process progressive audiovisual emotion recognition method adopted by the embodiment has a relatively large improvement in recognition rate compared with a paper in recent years.
In this embodiment, the english of the Voice endpoint Detection algorithm is Voice Activity Detection, abbreviated as VAD, and the english of logarithmic energy is LogEnergy, abbreviated as LogE; the English of Zero-Crossing Rate is Zero-Crossing Rate, abbreviated as ZCR; the English language of the Teager Energy Operator is Teager Energy Operator, abbreviated as TEO; the Mel frequency cepstral coefficients are abbreviated as MFCC, and the speech endpoint detection algorithm, logarithmic energy, zero-crossing rate, Teager energy operator and Mel frequency cepstrum coefficients are all known in the technical field.
In the present embodiment, the calculation operation method is understandable to those skilled in the art.

Claims (1)

1. The video emotion recognition method fusing facial expression recognition and voice emotion recognition is characterized by comprising the following steps of: the decision-level-based two-process progressive audiovisual emotion recognition method comprises the following specific steps:
process a. facial image expression recognition is used as a first classification recognition:
the process A comprises the steps of extracting facial expression characteristics, grouping facial expressions and first-time classification of facial expression recognition, and comprises the following steps:
firstly, video frame extraction and voice signal extraction are carried out on a video signal:
decomposing the video in the database into an image frame sequence, performing video frame extraction by utilizing open-source FormatFactory software, and extracting and storing the voice signal in the video into an MP3 format;
secondly, preprocessing of the image frame sequence and the voice signal:
positioning and cutting the human face of the image frame sequence obtained in the first step by using a disclosed Viola & Jones algorithm, and normalizing the size of the cut human face image into M multiplied by M pixels to obtain an image frame sequence with the normalized size of the human face image;
carrying out voice detection on the voice signals obtained in the first step by utilizing a known voice endpoint detection algorithm VAD and removing noise and silence segments to obtain voice signals with characteristics easier to extract;
thereby completing the pre-processing of the image frame sequence and the voice signal;
thirdly, marking the human face characteristic points in the image frame sequence and screening key frames in the image frame sequence:
carrying out face T feature point marking on the image frame sequence with the normalized face image size in the second step, wherein T is 68, the positions of the 68 feature points are known, the marked feature point outlines are respectively in the eyes, eyebrows, noses and mouth regions of the face image, and calculating the following 6 specific distances for the u frame image in the image frame sequence with the normalized face image size in the second step according to the coordinates of the T feature points:
the distance between the eyes and the eyebrows in the vertical direction is Du,1:Du,1=dvertical||p22,p40||,
The distance in the vertical direction of the opening of the eye is Du,2:Du,2=dvertical||p45,p47||,
The distance between the eyes and the mouth in the vertical direction is Du,3:Du,3=dvertical||p37,p49||,
The distance D in the vertical direction between the nose and the mouthu,4:Du,4=dvertical||p34,p52||,
The distance in the vertical direction of the upper and lower lips is Du,5:Du,5=dvertical||p52,p58||,
The two sides of the mouth have a width distance D in the horizontal directionu,6:Du,6=dhorizontal||p49,p55||,
And is provided with
dvertical||pi,pj||=|pj,y-pi,y|,dhorizontal||pi,pj||=|pj,x-pi,x| (1),
In the formula (1), piIs the coordinate set of the ith feature point, pjIs the coordinate set of the jth feature point, pi,yIs the ordinate, p, of the i-th feature pointj,yIs the ordinate of the jth feature point, pi,xIs the abscissa, p, of the ith feature pointj,xIs the abscissa of the jth feature point, dvertical||pi,pjI is the vertical distance between feature points i and j, dhorizontal||pi,pjI | is the horizontal distance between feature points i and j, i ═ 1,2, …,68, j ═ 1,2, …, 68;
setting the first frame in the image frame sequence with normalized human face image size in the second step as a neutral frame, and the set V of 6 specific distances0As shown in the formula (2),
V0=[D0,1,D0,2,D0,3,D0,4,D0,5,D0,6] (2),
in the formula (2), D0,1,D0,2,D0,3,D0,4,D0,5And D0,66 specific distances corresponding to neutral frames in the image frame sequence with normalized face image size in the second step;
the set V of 6 specific distances of the u frame in the image frame sequence with normalized face image size in the second stepuAs shown in the formula (3),
Vu=[Du,1,Du,2,Du,3,Du,4,Du,5,Du,6] (3),
in the formula (3), u is 1,2, …, K-1, where K is the number of face images in the group of image frame sequences with the face image size normalized in the second step, and Du,1,Du,2,Du,3,Du,4,Du,5,Du,66 specific distances corresponding to the u frame in the image frame sequence with normalized face image size in the second step;
the sum of the 6 corresponding specific distance ratios of the u-th frame and the neutral frame in the image frame sequence with normalized face image size in the second step is shown in formula (4),
Figure FDA0002714810720000021
in formula (4), DFuRepresenting the sum of the ratios of 6 specific distances corresponding to the neutral frame image and the u frame image in the image frame sequence with the normalized face image size in the second step, n represents the number of 6 specific distances, D0,nRepresenting the n-th specific distance, D, corresponding to the neutral frame in the image frame sequence with normalized face image size in the second stepu,nRepresenting the nth specific distance corresponding to the u frame in the image frame sequence with the normalized face image size in the second step;
in the image frame sequence with the normalized face image size in the second step, the formula (A) is used2) The specific value DF of the specific distance corresponding to each frame of image in the image frame sequence is obtained by the formula (3) and the formula (4)uScreening for maximum DFuThe corresponding u-th frame image is the key frame in the image frame sequence,
thus, marking the facial feature points of the image frame sequence and screening key frames in the image frame sequence;
fourthly, extracting the texture features of the human face:
extracting human face texture features by using an LBP-TOP algorithm, firstly, dividing an image frame sequence with the normalized human face image size in the second step into XY, XT and YT orthogonal planes in space-time, calculating LBP values of central pixel points in 3 x 3 neighborhoods in each orthogonal plane, counting LBP histogram features of the three orthogonal planes, and finally connecting the LBP histograms of the three orthogonal planes to form an integral feature vector, wherein the LBP operator calculation method is shown as a formula (5) and a formula (6),
Figure FDA0002714810720000022
Figure FDA0002714810720000023
in the formula (5) and the formula (6), Z is the number of neighborhood points of the central pixel point, R is the distance between the neighborhood points and the central pixel point, and t iscPixel value, t, of the central pixel pointqIs the pixel value of the q-th neighborhood point, Sig (t)q-tc) The LBP coding value for the qth neighborhood point,
the LBP-TOP histogram is defined as shown in equation (7),
Figure FDA0002714810720000031
in formula (7), b is the number of the plane, b is 0 and XY plane, b is 1 and XT plane, b is 2 and YT plane, n isbIs the number of binary patterns produced by the LBP operator on the b-th plane, I { LBP }Z,R,b(x, y, t) ═ a } employing LBP in the (b) th planeZ,RWhen the operator carries out feature extraction, the number of pixel points with the LBP coding value of a is counted;
fifthly, extracting geometric features of the human face:
calculating coordinates of T feature points marked in the key frame according to the key frame in the screened image frame sequence obtained in the third step to obtain geometric features of the human face expression, wherein in the field of human face expression recognition, the region with the most abundant human face features is a human face T-shaped region and mainly comprises eyebrow, eye, nose, chin and mouth regions, so that the extraction method of the geometric features of the human face mainly extracts distance features between the mark points of the human face T-shaped region;
and 5.1, calculating the Euclidean distance characteristics of the face characteristic point pairs:
selecting 14 pairs of feature points of eyebrows, 12 pairs of feature points of eyes, 12 pairs of feature points of mouths, 6 pairs of feature points of noses and 6 pairs of feature points of chins from T feature points in key frames in the screening image frame sequence obtained in the third step, totaling 50 pairs of feature points, calculating Euclidean distance between feature point pairs A and B, totaling 50 dimension Euclidean distance features, and marking as G50Equation (8) for calculating the euclidean distance between the characteristic point pairs a and B is as follows,
Figure FDA0002714810720000032
in the formula (8), pAIs a set of coordinates, p, of the feature point ABIs a set of coordinates, p, of the feature point BA,xIs the abscissa, p, of the feature point AA,yIs the ordinate, p, of the characteristic point AB,xIs the abscissa, p, of the feature point BB,yIs the ordinate of the characteristic point B;
and 5.2, calculating the angle characteristics of the face characteristic points:
selecting 10 angles representing face feature change from the T feature points of the key frame obtained by the third step of screening, wherein the eyebrow angle is 2, and the eye angle is 2Calculating 6 angles and 2 angles of the mouth, extracting angle characteristics, and recording the angle characteristics as Q in 10-dimensional angle characteristics10Equation (9) for calculating the angle of the feature point is as follows,
Figure FDA0002714810720000033
in formula (9), pC、pD、pEThree feature point coordinate sets corresponding to the angles formed by the eyebrow, the eye and the mouth regions marked on the face feature points by the third step, wherein p isDA vertex point coordinate set;
and 5.3, calculating the area characteristics of the face region:
selecting 5 areas of the face image, including left and right eyebrows, two eyes and a mouth, respectively calculating the area characteristics of the 5 areas, and correspondingly subtracting the area of the 5 face areas extracted by the key frame from the area of the 5 face areas extracted by the neutral frame to obtain the change characteristics of the area of the face image area due to the difference of the sizes of the face organs of each person, wherein the total 5 dimensions are recorded as O5Setting the eyebrow area, mouth area and eye area of human face as triangles, calculating the area of each triangle by using Helen formula, and determining the Euclidean distance characteristic G of the feature point pair of human face50Angular characteristic Q of human face characteristic points10And face region area feature O5The geometric features F combined as a face are shown in formula (10),
F=[G50 Q10 O5] (10),
at this point, the facial texture features and the facial geometric features are connected in series to complete the extraction of the facial expression features;
sixthly, grouping the facial expressions:
the six emotions of the human face are: surprise, fear, anger, aversion, happiness and sadness, which are divided into three groups in pairs, wherein the groups are as follows:
a first group: surprise and fear; second group: angry and aversion; third group: happy and sad;
seventhly, classifying the facial expression for the first time:
putting the facial expression features extracted in the fourth step and the fifth step into an ELM classifier for training and testing, thereby finishing the first classification of facial expression recognition and obtaining the recognition result of the first classification of the facial expression recognition, wherein the parameters of the ELM are set as: ELM type: "classification", number of hidden layer neurons: "20", activation function: a "Sigmoid" function;
and B, taking the speech emotion recognition as a second classification recognition:
in the process B, based on the facial expression recognition result in the process a, the speech features are combined, and the speech emotion feature extraction and the second classification of the speech emotion recognition are performed on each of the three groups in the sixth facial expression grouping, specifically, the operations are as follows:
and eighth, extracting the speech emotion characteristics:
and aiming at the classification result of the first classification of the facial expression recognition in the seventh step, according to the grouping in the sixth step, different prosodic features are respectively extracted according to the different sensitivity degrees of the emotions of each group to different audio prosodic features:
a first group: extracting zero crossing rate ZCR and logarithmic energy LogE,
second group: extracting a Teager energy operator TEO, a zero-crossing rate ZCR and a logarithmic energy LogE,
third group: extracting Pitch Pitch, zero-crossing rate ZCR and Teager energy operator TEO,
the Pitch height Pitch of the prosodic feature described above is calculated in the frequency domain,
for the speech signal M preprocessed in the above-described second step, the Pitch Pitch is calculated by the following equation (11),
Figure FDA0002714810720000041
in equation (11), Pitch is Pitch, DFT is a discrete Fourier transform function, LMWhich represents the length of the speech signal and,
Figure FDA0002714810720000042
representing the speech signal plus a hamming window,
Figure FDA0002714810720000043
is calculated as shown in the following equation (12),
Figure FDA0002714810720000044
in the formula (12), N is the number of Hamming windows, and m is the mth Hamming window;
the zero crossing rate ZCR in the prosodic feature is calculated as shown in equation (13),
Figure FDA0002714810720000045
in formula (13), ZCR represents the average zero-crossing rate of N windows, | | is the absolute value sign, x (m) is the speech signal of the mth window after framing and windowing, sgn { x (m) } function judges the sign of the speech amplitude, sgn { x (m) } function is calculated by formula (14),
Figure FDA0002714810720000051
in formula (14), x (m) is the speech signal of the mth window after the framing windowing;
the above log energy LogE is calculated as in equation (15),
Figure FDA0002714810720000052
in formula (15), LogE represents the total logarithmic energy of N windows, x (m) is the speech signal of the mth window after framing and windowing, and N is the number of windows;
the Teager energy operator TEO is defined as shown in equation (16),
Figure FDA0002714810720000053
in formula (16), ψ [ X (m)]The Teager energy operator TEO for the mth window, X' (m) ═ dX (m)/dm, X "(m) ═ dX2(m)/dm2For signals of constant amplitude and frequency: x (m) acos (Φ m + θ), where a is the signal amplitude, Φ is the signal frequency, θ is the signal initial phase angle,
extracting well-known Mel Frequency Cepstrum Coefficients (MFCCs) and first-order difference features and second-order difference features of the well-known Mel Frequency Cepstrum Coefficients (MFCCs) and the first-order difference features of the audio files corresponding to the image frames of each of the three groups in the sixth step of grouping of the facial expressions, and finally connecting the extracted prosody features of each group, the corresponding Mel Frequency Cepstrum Coefficients (MFCCs) and the first-order difference features and the second-order difference features of the prosody features in series to form mixed audio features,
thereby completing the extraction of the speech emotion characteristics;
and ninthly, classifying for the second time of the speech emotion recognition:
putting the speech emotion characteristics extracted in the eighth step into an SVM for training and testing, and finally obtaining the recognition rate of speech emotion recognition, wherein the parameters of the SVM are set as: penalty coefficient: "95", allowing redundant outputs: "0", nuclear parameters: "1", kernel function of support vector machine: the expression "Gaussian nucleus",
thereby completing the second classification of the speech emotion recognition;
and C, fusing facial expression recognition and voice emotion recognition:
and step ten, fusing the facial expression recognition and the voice emotion recognition on a decision level:
since the speech emotion recognition in the process B is a secondary recognition performed on the basis of the face emotion recognition in the process a, the relationship between the two recognition rates belongs to a conditional probability relationship, and the final recognition rate P (Audio _ Visual) calculation method is as shown in formula (17),
P(Audio_Visual)=P(Visual)×P(Audio|Visual) (17),
in formula (17), P (Visual) is the recognition rate of the first face image recognition, and P (Audio/Visual) is the recognition rate of the second speech emotion;
and finishing the progressive video emotion recognition based on two decision-level processes, namely facial expression recognition and voice emotion recognition fusion.
CN201811272233.1A 2018-10-30 2018-10-30 Video emotion recognition method integrating facial expression recognition and voice emotion recognition Active CN109409296B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811272233.1A CN109409296B (en) 2018-10-30 2018-10-30 Video emotion recognition method integrating facial expression recognition and voice emotion recognition

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811272233.1A CN109409296B (en) 2018-10-30 2018-10-30 Video emotion recognition method integrating facial expression recognition and voice emotion recognition

Publications (2)

Publication Number Publication Date
CN109409296A CN109409296A (en) 2019-03-01
CN109409296B true CN109409296B (en) 2020-12-01

Family

ID=65470610

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811272233.1A Active CN109409296B (en) 2018-10-30 2018-10-30 Video emotion recognition method integrating facial expression recognition and voice emotion recognition

Country Status (1)

Country Link
CN (1) CN109409296B (en)

Families Citing this family (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109961054A (en) * 2019-03-29 2019-07-02 山东大学 It is a kind of based on area-of-interest characteristic point movement anxiety, depression, angry facial expression recognition methods
CN110363074B (en) * 2019-06-03 2021-03-30 华南理工大学 Humanoid recognition interaction method for complex abstract events
CN110414335A (en) * 2019-06-20 2019-11-05 北京奇艺世纪科技有限公司 Video frequency identifying method, device and computer readable storage medium
CN110443143B (en) * 2019-07-09 2020-12-18 武汉科技大学 Multi-branch convolutional neural network fused remote sensing image scene classification method
CN112308102B (en) * 2019-08-01 2022-05-17 北京易真学思教育科技有限公司 Image similarity calculation method, calculation device, and storage medium
CN110677598B (en) * 2019-09-18 2022-04-12 北京市商汤科技开发有限公司 Video generation method and device, electronic equipment and computer storage medium
CN111144197A (en) * 2019-11-08 2020-05-12 宇龙计算机通信科技(深圳)有限公司 Human identification method, device, storage medium and electronic equipment
CN111178389B (en) * 2019-12-06 2022-02-11 杭州电子科技大学 Multi-mode depth layered fusion emotion analysis method based on multi-channel tensor pooling
CN111553311A (en) * 2020-05-13 2020-08-18 吉林工程技术师范学院 Micro-expression recognition robot and control method thereof
CN111798874A (en) * 2020-06-24 2020-10-20 西北师范大学 Voice emotion recognition method and system
CN112101462B (en) * 2020-09-16 2022-04-19 北京邮电大学 Electromechanical device audio-visual information fusion method based on BMFCC-GBFB-DNN
CN112418095B (en) * 2020-11-24 2023-06-30 华中师范大学 Facial expression recognition method and system combined with attention mechanism
CN112488219A (en) * 2020-12-07 2021-03-12 江苏科技大学 Mood consolation method and system based on GRU and mobile terminal
CN112766112B (en) * 2021-01-08 2023-01-17 山东大学 Dynamic expression recognition method and system based on space-time multi-feature fusion
CN114005153A (en) * 2021-02-01 2022-02-01 南京云思创智信息科技有限公司 Real-time personalized micro-expression recognition method for face diversity
CN112949560B (en) * 2021-03-24 2022-05-24 四川大学华西医院 Method for identifying continuous expression change of long video expression interval under two-channel feature fusion
CN113065449B (en) * 2021-03-29 2022-08-19 济南大学 Face image acquisition method and device, computer equipment and storage medium
CN113076847B (en) * 2021-03-29 2022-06-17 济南大学 Multi-mode emotion recognition method and system
CN113111789B (en) * 2021-04-15 2022-12-20 山东大学 Facial expression recognition method and system based on video stream
CN113128399B (en) * 2021-04-19 2022-05-17 重庆大学 Speech image key frame extraction method for emotion recognition
CN117577140B (en) * 2024-01-16 2024-03-19 北京岷德生物科技有限公司 Speech and facial expression data processing method and system for cerebral palsy children

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1731416A (en) * 2005-08-04 2006-02-08 上海交通大学 Method of quick and accurate human face feature point positioning
CN105139004A (en) * 2015-09-23 2015-12-09 河北工业大学 Face expression identification method based on video sequences
CN105976809A (en) * 2016-05-25 2016-09-28 中国地质大学(武汉) Voice-and-facial-expression-based identification method and system for dual-modal emotion fusion
CN107704810A (en) * 2017-09-14 2018-02-16 南京理工大学 A kind of expression recognition method suitable for medical treatment and nursing
CN108682431A (en) * 2018-05-09 2018-10-19 武汉理工大学 A kind of speech-emotion recognition method in PAD three-dimensionals emotional space

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1731416A (en) * 2005-08-04 2006-02-08 上海交通大学 Method of quick and accurate human face feature point positioning
CN105139004A (en) * 2015-09-23 2015-12-09 河北工业大学 Face expression identification method based on video sequences
CN105976809A (en) * 2016-05-25 2016-09-28 中国地质大学(武汉) Voice-and-facial-expression-based identification method and system for dual-modal emotion fusion
CN107704810A (en) * 2017-09-14 2018-02-16 南京理工大学 A kind of expression recognition method suitable for medical treatment and nursing
CN108682431A (en) * 2018-05-09 2018-10-19 武汉理工大学 A kind of speech-emotion recognition method in PAD three-dimensionals emotional space

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
"A new approach of audio emotion recognition";Chien Shing Ooi.et al;《ELSEVIER》;20140324;期刊第5858-5869页 *
"基于LBP-TOP特征的微表情识别";卢官明等;《南京邮电大学学报(自然科学版)》;20171231;第37卷(第6期);第2章2.7-2.8节 *
Emotion recognition using facial and audio features;Tarun Krishna.et al;《ICMI "13: Proceedings of the 15th ACM on International conference on multimodal interaction》;20131231;全文 *

Also Published As

Publication number Publication date
CN109409296A (en) 2019-03-01

Similar Documents

Publication Publication Date Title
CN109409296B (en) Video emotion recognition method integrating facial expression recognition and voice emotion recognition
CN112784798B (en) Multi-modal emotion recognition method based on feature-time attention mechanism
CN108805087B (en) Time sequence semantic fusion association judgment subsystem based on multi-modal emotion recognition system
CN108805089B (en) Multi-modal-based emotion recognition method
CN108597541B (en) Speech emotion recognition method and system for enhancing anger and happiness recognition
CN108877801B (en) Multi-turn dialogue semantic understanding subsystem based on multi-modal emotion recognition system
CN108805088B (en) Physiological signal analysis subsystem based on multi-modal emotion recognition system
Chen et al. Extracting speaker-specific information with a regularized siamese deep network
CN108269133A (en) A kind of combination human bioequivalence and the intelligent advertisement push method and terminal of speech recognition
Liu et al. Re-synchronization using the hand preceding model for multi-modal fusion in automatic continuous cued speech recognition
CN101187990A (en) A session robotic system
Datcu et al. Emotion recognition using bimodal data fusion
CN102930298A (en) Audio visual emotion recognition method based on multi-layer boosted HMM
CN111583964A (en) Natural speech emotion recognition method based on multi-mode deep feature learning
Ocquaye et al. Dual exclusive attentive transfer for unsupervised deep convolutional domain adaptation in speech emotion recognition
CN113643723A (en) Voice emotion recognition method based on attention CNN Bi-GRU fusion visual information
CN108256307A (en) A kind of mixing enhancing intelligent cognition method of intelligent business Sojourn house car
CN110211594A (en) A kind of method for distinguishing speek person based on twin network model and KNN algorithm
Shinde et al. Real time two way communication approach for hearing impaired and dumb person based on image processing
Jaratrotkamjorn et al. Bimodal emotion recognition using deep belief network
Lim et al. Emotion Recognition by Facial Expression and Voice: Review and Analysis
Huang et al. Speech emotion recognition using convolutional neural network with audio word-based embedding
CN107437090A (en) The continuous emotion Forecasting Methodology of three mode based on voice, expression and electrocardiosignal
Veni et al. Feature fusion in multimodal emotion recognition system for enhancement of human-machine interaction
Atkar et al. Speech Emotion Recognition using Dialogue Emotion Decoder and CNN Classifier

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant