CN109409296A

CN109409296A - The video feeling recognition methods that facial expression recognition and speech emotion recognition are merged

Info

Publication number: CN109409296A
Application number: CN201811272233.1A
Authority: CN
Inventors: 于明; 张冰; 郭迎春; 于洋; 师硕; 郝小可; 朱叶; 阎刚
Original assignee: Hebei University of Technology
Current assignee: Hebei University of Technology
Priority date: 2018-10-30
Filing date: 2018-10-30
Publication date: 2019-03-01
Anticipated expiration: 2038-10-30
Also published as: CN109409296B

Abstract

The present invention is the video feeling recognition methods for merging facial expression recognition and speech emotion recognition, it is related to the processing of the record carrier of figure for identification, it is a kind of audiovisual emotion identification method that two based on decision level process is progressive, this method separates facial expression recognition and the speech emotion recognition in video, using the method for the progressive emotion recognition of two processes, by way of design conditions probability, the technology of speech emotion recognition is carried out on the basis of facial expression recognition；Step is: process A. is using facial image Expression Recognition as first time Classification and Identification；Process B. is using speech emotion recognition as second of Classification and Identification；The fusion of process C. facial expression recognition and speech emotion recognition.The present invention overcomes the prior arts had ignored in human emotion's identification inner link between face characteristic and phonetic feature and video feeling identification recognition speed is slow and defect that discrimination is not high.

Description

The video feeling recognition methods that facial expression recognition and speech emotion recognition are merged

Technical field

Technical solution of the present invention is related to the processing of the record carrier of figure for identification, specifically by human face expression The video feeling recognition methods of identification and speech emotion recognition fusion.

Background technique

With the rapid development of artificial intelligence and computer vision technique, human-computer interaction technology makes rapid progress, and utilizes calculating Human emotion's identification technology that machine carries out has had received widespread attention, and how computer to be made more rapidly and accurately to identify people Class emotion becomes the research hotspot of field of machine vision instantly.

The emotional expression mode of the mankind be it is diversified, mainly have human face expression, speech emotional, upper body posture and language Text etc..Wherein, human face expression and speech emotional are two kinds of emotional expression modes the most typical.Due to the texture of face and several What feature is easier to extract, based on the emotion identification method of human face expression current emotion recognition field have been able to reach compared with High discrimination.However, in the similar situation of some expressions, such as angry and detest, fear and surprised, their texture is special Geometrical characteristic of seeking peace is more similar, and it is not high to carry out knowledge method for distinguishing discrimination only by the feature for extracting human face expression.

Often there is certain limitation, bimodal or multi-modal emotion recognition side in the emotion identification method of single mode Method increasingly becomes the hot spot of the research of emotion recognition research field and concern.The key of multi-modal emotion identification method is a variety of The amalgamation mode of mode, wherein the amalgamation mode of mainstream has feature-based fusion mode and decision level fusion mode.

2012, Schuller et al. was in paper " AVEC:the continuous audio/visual emotion Audio and video feature is cascaded as single feature vector in challenge ", and uses support vector regression SVR as AVEC Multi-modal feature is carried out directly cascade building union feature vector by the baseline in 2012 challenges, this feature level fusing method.By Dimension disaster is easily caused in multi-modal feature quantity is huge, high dimensional feature is highly susceptible to the puzzlement of Sparse Problem, examines Consider the interaction between feature, combining in feature-based fusion mode will be limited the advantages of audio frequency characteristics and video features System.

Decision level fusion mode refers to that a variety of emotional expression modes can be modeled by corresponding classifier first, then The recognition result of each classifier is fused together to be formed based on decision level fusion mode, in the case where not increasing dimension, leads to The contribution degree of different emotional expressions is crossed to combine different modes.Seng et al. is in paper " A combined rule- By audiovisual emotion in based&machine learning audio-visual emotion recognition approach " Identification is divided into two mutually independent paths and extracts feature respectively, then models on respective classifier respectively, and it is respectively right to find out The discrimination answered finally obtains final discrimination according to ratio scoring and corresponding weight distribution.Existing decision level , mainly there is two o'clock in the shortcomings that amalgamation mode, and first: ratio scoring and weight distribution strategy lack unified authoritative standard, no Same researcher often obtains in same research project according to the ratio scoring and different weight distribution strategies of multiplicity Different recognition results；Second: decision level fusion mode stresses recognition of face and the fusion of speech recognition result, has ignored face Inner link between feature and phonetic feature.

CN106529504A discloses a kind of bimodal video feeling recognition methods of compound space-time characteristic, by existing volume Local binary patterns algorithm is extended to three value mode of space-time, obtains the three value mode squares of space-time part of human face expression and upper body posture Textural characteristics further merge three-dimensional gradient direction histogram feature to enhance the description to emotion video, by two kinds of feature groups Compound space-time characteristic is synthesized, this method exists when personage's upper body attitudes vibration is comparatively fast in video or upper body posture picture lacks It influences whether the realization of its algorithm, therefore combines the bimodal video feeling recognition methods of human face expression and upper body posture in feature Aspect is extracted to have some limitations.

CN105512609A disclose it is a kind of transfinited the multi-modal fusion video feeling recognition methods of learning machine based on core, it is right The image information and audio-frequency information of video carry out feature extraction, feature selecting, to obtain video features；By the multichannel of acquisition EEG signals are pre-processed, feature extraction and feature selecting, to obtain brain electrical feature；Establishing is transfinited learning machine based on core Multimodality fusion video feeling identification model；Video features and brain electrical feature are input to be transfinited the multimodality fusion of learning machine based on core Video feeling identification is carried out in video feeling identification model, obtains final classification accuracy rate.However, the algorithm exists only to three Class video feeling data classification discrimination is high, and availability owes high defect.

CN103400145B discloses a kind of voice based on clue neural network-vision fusion emotion identification method, should Method distinguishes frontal one expression, the characteristic in three channels of side countenance and voice of user first, independently A neural network is trained to execute the identification of discrete emotional category, the output layer in training process in neural network model adds Enter 4 clue nodes, the hint information of 4 coarseness classifications in liveness-evaluating deg space is carried respectively, then using more Modality fusion model merges the output result of three neural networks, and multi-modal fusion model is also using based on hint information Trained neural network, however since in most of videos, face side countenance frame number is less, it is difficult to it carries out effective Acquisition causes this method to have biggish limitation in practical operation.This method further relates to the training and fusion of neural network, With the raising of data volume increased with data dimension, the consumption of training time and resource is also being gradually increased, and error rate also can It is gradually increased.

CN105138991B discloses a kind of video feeling recognition methods based on the fusion of emotion significant characteristics, this method Audio frequency characteristics and visual emotion feature, the word-based packet model structure of audio frequency characteristics are extracted to video lens each in training video set At emotion distribution histogram feature；Visual emotion feature view-based access control model dictionary constitutes emotion attention feature, and emotion attention is special Sign with emotion distribution histogram feature carry out it is top-down merge, constitute with emotion conspicuousness video features.This method It is only extracted the feature of key frame of video when extracting visual emotion feature, is had ignored between video frame and frame to a certain extent The incidence relation of feature.

Summary of the invention

The technical problems to be solved by the present invention are: providing the video for merging facial expression recognition and speech emotion recognition Emotion identification method, is a kind of audiovisual emotion identification method that two based on decision level process is progressive, and this method is by video In facial expression recognition and speech emotion recognition separate, identified using the progressive video feeling of two processes, pass through calculating The mode of conditional probability carries out the technology of speech emotion recognition on the basis of facial expression recognition, and the present invention overcomes existing Technology has ignored inner link between face characteristic and phonetic feature and video feeling identification in human emotion identifies The defect that recognition speed is slow and discrimination is not high.

The present invention solves technical solution used by the technical problem: facial expression recognition and speech emotion recognition are melted The video feeling recognition methods of conjunction is a kind of audiovisual emotion identification method that two based on decision level process is progressive, specifically Steps are as follows:

Process A. is using facial image Expression Recognition as first time Classification and Identification:

Process A includes the first time point of the extraction of human face expression feature, the grouping of human face expression and facial expression recognition Class, steps are as follows:

Vision signal is carried out the extraction that video takes out frame and voice signal by the first step:

Video in database is resolved into image frame sequence, and is regarded using the FormatFactory software of open source Frequency takes out frame, and the voice signal in video is extracted and saves as MP3 format；

The pretreatment of second step, image frame sequence and voice signal:

The positioning of face is carried out simultaneously using disclosed Viola&Jones algorithm to the image frame sequence that the above-mentioned first step obtains It cuts, the facial image size after cutting is normalized to M × M pixel, obtains the normalized image frame sequence of facial image size；

Speech detection is carried out simultaneously using well known voice activity detection algorithm VAD to the voice signal that the above-mentioned first step obtains Noise and mute section are removed, the voice signal for being easier to extract feature is obtained；

Thus the pretreatment of image frame sequence and voice signal is completed；

Third step, image frame sequence carry out the label of human face characteristic point and screen the key frame in image frame sequence:

The normalized image frame sequence of above-mentioned second step facial image size is subjected to T characteristic point label of face, T's takes Value range is 1,2 ..., 68, the position of 68 characteristic points be it is well known, the feature dot profile of label is respectively on facial image Eyes, eyebrow, nose and mouth region, according to the coordinate of T characteristic point, for the facial image size in above-mentioned second step U frame image in normalized image frame sequence calculates following 6 specific ranges:

Distance is D in vertical direction between eyes and eyebrow_u,1: D_u,1=d_vertical||p₂₂,p₄₀| |,

Distance is D in the vertical direction that eyes open_u,2: D_u,2=d_vertical||p₄₅,p₄₇| |,

Distance is D in vertical direction between eyes and mouth_u,3: D_u,3=d_vertical||p₃₇,p₄₉| |,

Distance is D in vertical direction between nose and mouth_u,4: D_u,4=d_vertical||p₃₄,p₅₂| |,

Distance is D in the vertical direction of upper lower lip_u,5: D_u,5=d_vertical||p₅₂,p₅₈| |,

Width distance is D in the horizontal direction for the two sides of mouth_u,6: D_u,6=d_horizontal||p₄₉,p₅₅| |,

And have

d_vertical||p_i,p_j| |=| p_j,y-p_i,y|,d_horizontal||p_i,p_j| |=| p_j,x-p_i,x| (1),

In formula (1), p_iFor the coordinate set of ith feature point, p_jFor the coordinate set of j-th of characteristic point, p_i,yFor i-th of spy Levy the ordinate of point, p_j,yFor the ordinate of j-th of characteristic point, p_i,xFor the abscissa of ith feature point, p_j,xFor j-th of feature The abscissa of point, d_vertical||p_i,p_j| | it is characterized the vertical range between point i and j, d_horizontal||p_i,p_j| | it is characterized a little Horizontal distance between i and j, i=1,2 ..., 68, j=1,2 ..., 68；

If the first frame in the normalized image frame sequence of facial image size in above-mentioned second step is neutral frame, 6 The set V of a specific range₀For shown in formula (2),

V₀=[D_0,1,D_0,2,D_0,3,D_0,4,D_0,5,D_0,6] (2),

In formula (2), D_0,1,D_0,2,D_0,3,D_0,4,D_0,5And D_0,6Facial image size in respectively above-mentioned second step is returned 6 specific ranges corresponding to neutral frame in one image frame sequence changed；

6 specific ranges of the u frame in the normalized image frame sequence of facial image size in above-mentioned second step Set V_uFor shown in formula (3),

V_u=[D_u,1,D_u,2,D_u,3,D_u,4,D_u,5,D_u,6] (3),

In formula (3), u=1,2 ..., K-1, wherein K is normalized one group of facial image size in above-mentioned second step The quantity of facial image, D in image frame sequence_u,1,D_u,2,D_u,3,D_u,4,D_u,5,D_u,6Face figure in respectively above-mentioned second step 6 specific ranges as corresponding to the u frame in the normalized image frame sequence of size；

In the normalized image frame sequence of facial image size in above-mentioned second step 6 of u frame and neutral frame it is corresponding The sum of specific range ratio for shown in formula (4),

In formula (4), DF_uRepresent the neutrality in the normalized image frame sequence of facial image size in above-mentioned second step The ratio between frame image 6 specific ranges corresponding with u frame image and, n represents the number of 6 specific ranges, D_0,nIt represents above-mentioned N-th of specific range corresponding to the neutral frame in the normalized image frame sequence of facial image size in second step, D_u,nGeneration N-th of specific range corresponding to the u frame in the normalized image frame sequence of facial image size in the above-mentioned second step of table；

In the normalized image frame sequence of facial image size in above-mentioned second step, according to formula (2), formula (3) The ratio DF of specific range corresponding to each frame image in image frame sequence is acquired with formula (4), screening obtains maximum DF Corresponding picture frame is the key frame in the image frame sequence,

Thus it completes to carry out image frame sequence the label of human face characteristic point and screens the key frame in image frame sequence；

4th step, the extraction of face textural characteristics:

Face textural characteristics are extracted using LBP-TOP algorithm, firstly, by the facial image size normalizing in above-mentioned second step The image frame sequence of change, which is listed on space-time, is divided into tri- orthogonal planes of XY, XT and YT, and it is adjacent that 3 × 3 are calculated in each orthogonal plane The LBP value of central pixel point in domain, counts the LBP histogram feature of three orthogonal planes, finally by the LBP of three orthogonal planes Histogram links up to form global feature vector, and wherein LBP operator calculation method such as formula (5) and formula (6) are shown,

In formula (5) and formula (6), Z be center neighborhood of pixel points point number, R be neighborhood point to central pixel point it Between distance, t_cFor the pixel value of center pixel, t_qFor the pixel value of q-th of neighborhood point, Sig (t_q-t_c) it is q-th of neighborhood The LBP encoded radio of point,

LBP-TOP histogram definition such as formula (7) is shown,

In formula (7), b is the number of plane, and b=0 is X/Y plane, and b=1 is XT plane, and b=2 is YT plane, n_bBe The quantity of the dual mode generated in b-th of plane by LBP operator, I { LBP_Z,R,b(x, y, t)=a } it is to be adopted in b-th of plane Use LBP_Z,ROperator carries out the pixel number that LBP encoded radio when feature extraction is a；

5th step, the extraction of Face geometric eigenvector:

The key frame in screening image frame sequence obtained according to above-mentioned third step calculates T marked in the key frame The coordinate of characteristic point obtains the geometrical characteristic of human face expression, in facial expression recognition field, face characteristic region the most abundant It is face T font region, mainly includes eyebrow, eyes, nose, chin and mouth region, therefore Face geometric eigenvector mentions Method is taken mainly to extract the distance between the mark point in face T font region feature；

5.1st step calculates the Euclidean distance feature of human face characteristic point pair:

The 14 of eyebrow is chosen from T characteristic point in the key frame in the screening image frame sequence that above-mentioned third step obtains 6 pairs of features of 6 pairs of characteristic points and chin to characteristic point, 12 pairs of characteristic points of eyes, 12 pairs of characteristic points of mouth and nose Point amounts to 50 pairs of characteristic points, and calculates characteristic point to the Euclidean distance between A and B, and totally 50 dimension Euclidean distance feature, is denoted as G₅₀, Characteristic point is calculated to the formula (8) of the Euclidean distance between A and B as follows,

In formula (8), p_AIt is characterized the coordinate set of point A, p_BIt is characterized the coordinate set of point B, p_A,xIt is characterized the horizontal seat of point A Mark, p_A,yIt is characterized the ordinate of point A, p_B,xIt is characterized the abscissa of point B, p_B,yIt is characterized the ordinate of point B；

5.2nd step calculates the angle character of human face characteristic point:

10 angles that selection characterization face characteristic changes in T characteristic point of the key frame obtained from the screening of above-mentioned third step Degree, wherein 22 angles of eyebrow, 6 angles of eyes and mouth angles are calculated, and extract angle character, totally 10 Wei Jiaodute Sign, is denoted as Q₁₀, the formula (9) for calculating characteristic point angle is as follows,

In formula (9), p_C、p_D、p_EIt is eyebrow, eyes and the mouth region shape that above-mentioned third step marks human face characteristic point At angle corresponding to three characteristic point coordinate sets, wherein p_DFor corner point coordinate set；

5.3rd step calculates human face region area features:

5 regions of facial image, including left and right eyebrow, two eyes and mouth are selected, this 5 areas are calculated separately The area features in domain, due to the otherness of everyone human face size, here by extracted 5 human face regions of key frame Area with the area of neutral extracted 5 human face regions of frame is corresponding subtracts each other, obtain the variation spy of facial image region area Sign, totally 5 dimensions are denoted as O₅, face brow region, face mouth region and face eye areas are set as triangle, utilize sea Human relations formula calculates each triangle area, by the Euclidean distance feature G of human face characteristic point pair₅₀, the angle character of human face characteristic point Q₁₀With human face region area features O₅It combines shown in the geometrical characteristic F such as formula (10) as face,

F=[G₅₀ Q₁₀ O₅] (10),

So far, series connection face textural characteristics and Face geometric eigenvector complete the extraction of human face expression feature；

6th step, the grouping of human face expression:

By six kinds of emotions of face: it is surprised, fear, is angry, detesting, fast happy sadness, it is divided into three groups two-by-two, it is specific to be grouped It is as follows:

First group: surprised, fear；Second group: angry, detest；Third group: happy, sad；

7th step, the first subseries of facial expression recognition:

The human face expression feature that above-mentioned 4th step and the 5th step are extracted is put into ELM classifier and is trained and surveys Thus examination completes the first subseries of facial expression recognition, obtains the recognition result of the first subseries of facial expression recognition, The parameter setting of middle ELM are as follows: ELM type: " classification ", hidden layer neuron number: " 20 ", activation primitive: " Sigmoid " letter Number；

Process B. is using speech emotion recognition as second of Classification and Identification:

Process B is on the basis of the facial expression recognition result of process A, in conjunction with phonetic feature, respectively to above-mentioned Each group of three groups in the grouping of 6th step human face expression carries out carrying out speech emotional feature extraction and speech emotion recognition The second subseries, concrete operations are as follows:

8th step, the extraction of speech emotional feature:

For the classification results of the first subseries of above-mentioned 7th step facial expression recognition, according to the grouping of the 6th step, often One group of emotion extracts different prosodic features to the difference of the sensitivity of different audio prosodic features respectively:

First group: zero-crossing rate ZCR and logarithmic energy LogE is extracted,

Second group: Teager energy operator TEO, zero-crossing rate ZCR, logarithmic energy LogE are extracted,

Third group: extracting pitch Pitch, zero-crossing rate ZCR, Teager energy operator TEO,

Pitch Pitch is to calculate in a frequency domain in above-mentioned prosodic features,

Voice signal M pretreated for above-mentioned second step calculates pitch Pitch with following formula (11),

In formula (11), Pitch is pitch, and DFT is discrete Fourier transform function, L_MThe length of voice signal is represented,It represents voice signal and adds Hamming window,Be calculated as shown in following formula (12),

In formula (12), N is the quantity of Hamming window, and m is m-th of Hamming window；

Shown in the calculating such as formula (13) of zero-crossing rate ZCR in above-mentioned prosodic features,

In formula (13), ZCR indicates the Average zero-crossing rate of N number of window, | | be absolute value sign, X (m) be framing adding window it The voice signal of m-th of window afterwards, sgn { X (m) } function judge the positive and negative of speech amplitude, and sgn { X (m) } function is by formula (14) it calculates,

In formula (14), X (m) is the voice signal of m-th of window after framing adding window；

The calculation formula (15) of above-mentioned logarithmic energy LogE is as follows,

In formula (15), LogE indicates that total logarithmic energy of N number of window, X (m) are m-th of windows after framing adding window The voice signal of mouth, N is number of windows；

Teager energy operator TEO definition such as formula (16) is shown,

In formula (16), ψ [X (m)] is Teager energy operator TEO, X ' (m)=dX (the m)/dm, X " of m-th of window (m)=dX²(m)/dm², for the signal of amplitude and frequency-invariant: X (m)=acos (φ m+ θ), wherein a is signal amplitude, φ For signal frequency, θ is signal initial phase angle,

Each group of the corresponding audio file of picture frame of three groups in the grouping of above-mentioned 6th step human face expression is mentioned Well known mel-frequency cepstrum coefficient MFCC and its first-order difference feature and second differnce feature are taken, finally extracts each group Prosodic features and corresponding mel-frequency cepstrum coefficient MFCC and the differential dtex of one second differnce feature of seeking peace connect Form the audio frequency characteristics of mixing,

Thus the extraction of speech emotional feature is completed；

9th step, the second subseries of speech emotion recognition:

The speech emotional feature that above-mentioned 8th step is extracted is put into SVM to be trained and test, finally obtains voice feelings The other discrimination of perception, the wherein parameter setting of SVM are as follows: penalty coefficient: " 95 " allow redundancy to export: " 0 ", nuclear parameter: and " 1 ", The kernel function of support vector machines: " Gaussian kernel ",

Thus the second subseries of speech emotion recognition is completed；

The fusion of process C. facial expression recognition and speech emotion recognition:

The fusion of tenth step, facial expression recognition and speech emotion recognition in decision level:

Since the speech emotion recognition of above process B is carried out on the basis of the face emotion recognition of above process A Secondary identification, therefore the relationship of discrimination belongs to the relationship of conditional probability twice, final discrimination P (Audio_Visual) calculates Shown in method such as formula (17),

P (Audio_Visual)=P (Visual) × P (Audio | Visual) (17),

In formula (17), P (Visual) is the discrimination of first time facial image identification, and P (Audio/Visual) is the The discrimination of secondary speech emotional；

So far complete the progressive facial expression recognition of two processes based on decision level and speech emotion recognition fusion Video feeling identification.

The above-mentioned video feeling recognition methods for merging facial expression recognition and speech emotion recognition, in the third step According to the coordinate of T characteristic point, wherein T=68.

The above-mentioned video feeling recognition methods for merging facial expression recognition and speech emotion recognition, the sound end inspection The English of method of determining and calculating is Voice Activity Detection, is abbreviated as VAD；The English of zero-crossing rate is Zero-Crossing Rate is abbreviated as ZCR；The English of logarithmic energy is LogEnergy, is abbreviated as LogE；The English of mel-frequency cepstrum coefficient is Mel-frequency cepstral coefficients, is abbreviated as MFCC；The English of Teager energy operator is Teager Energy Operator is abbreviated as TEO, voice activity detection algorithm here, zero-crossing rate, logarithmic energy, mel-frequency cepstrum Coefficient, Teager energy operator are well known to the art.

The above-mentioned video feeling recognition methods for merging facial expression recognition and speech emotion recognition, related calculating behaviour Make method, is that those skilled in the art will appreciate that.

The beneficial effects of the present invention are: compared with prior art, substantive distinguishing features outstanding of the invention and marked improvement It is as follows:

(1) present invention provides the video feeling recognition methods for merging facial expression recognition and speech emotion recognition, is one Kind of the audiovisual emotion identification method based on decision level, this method is by the facial expression recognition and speech emotion recognition point in video It opens, using the method for the progressive emotion recognition of two processes, by the calculation of conditional probability, in facial expression recognition On the basis of the speech emotion recognition technology that carries out, fully take into account shadow of the result to speech emotion recognition of facial expression recognition It rings, so that facial expression recognition and speech emotion recognition merge more closely, the two is mutually assisted, and is achieved more preferably Human emotion's recognition effect, overcome it is of the existing technology human emotion identification in have ignored face characteristic and phonetic feature Between inner link, the defect that the recognition speed of video feeling identification is slow and discrimination is not high.

(2) different emotions has different sensitivitys to different prosodic features, and Chien et al. is in 2014 in " A The experiment of one " acoustic characteristic analysis " has been done in a new approach of audio emotion recognition " text, 6 kinds of emotions of the experimental demonstration are to prosodic features, i.e. Pitch, Zero-Crossing Rate, LogEnergy, Teager The sensitivity of Energy Operator is different.Mel-frequency cepstrum coefficient Mel-Frequency of the paper to audio extraction Cepstral Coefficients, MFCC, are classified using SVM classifier, when the classification of carry out two, four classification and six classification When discrimination successively gradually decrease, point classification it is fewer, the classifying quality of classifier is better, thus in the present invention we the Three classification of selection when facial expression classification, two classification of selection when second of audio classification.The method of the present invention will be polytypic The problem of problem reduction is three classification and two classification, had not only reduced characteristic dimension but also had shortened the training time, be greatly improved The efficiency of algorithm.

(3) compared with CN106529504A, the method for the present invention is advantageously not only extracted in video the present invention Face characteristic, the audio frequency characteristics being also extracted in video, being combined with for face characteristic and audio frequency characteristics bimodal be conducive to view The emotion of people in frequency does more accurate identification.

(4) for the present invention compared with CN105512609A, method proposed in CN105512609A only can recognize that video In three kinds of emotions, and the present invention can identify six kinds of emotions in video, and average recognition rate of the invention compared to Video feeling discrimination in CN105512609A is high by 9.92%.

(5) present invention is compared with CN105138991A, and the method for the present invention is advantageously by face characteristic and voice Feature is classified respectively, " the dimension disaster " be easy to causeing when avoiding feature-based fusion, the fusion method in decision level, operation Simply, it trains with recognition speed faster.

(6) present invention considers that different emotions are different to the sensitivity of different audio frequency characteristics when extracting audio frequency characteristics, So that different audio frequency characteristics are extracted in each grouping, be conducive to the second subseries based on phonetic feature.

(7) present invention is extracted texture, geometry, time, prosodic features, and different characteristic reflects the different characteristics of expression, Classifier can be preferably trained, carries out video feeling identification from multiple modalities.

(8) present invention has used secondary progressive sensibility classification method, based on recognition of face, supplemented by speech recognition, and two Person is complementary, mutually assists, and more accurately video feeling identification may be implemented.

Detailed description of the invention

Present invention will be further explained below with reference to the attached drawings and examples.

Fig. 1 is the method for the present invention schematic process flow diagram.

Fig. 2 is the label schematic diagram of 68 characteristic points of 6 specific ranges and face.

Fig. 3 is the instance graph of 68 characteristic points label of a face in 05 database of eNTERFACE '.

Specific embodiment

Embodiment illustrated in fig. 1 shows the process of the method for the present invention are as follows: process A. is using facial image Expression Recognition as first Subseries identification → by vision signal carry out video take out frame and voice signal extraction → image frame sequence and voice signal it is pre- Processing → the label of human face characteristic point is carried out to image frame sequence and screens the key frame in image frame sequence → face texture spy Extraction → Face geometric eigenvector extraction → human face expression the first subseries of grouping → facial expression recognition of sign；Process B. Using speech emotion recognition as second point of extraction → speech emotion recognition of second of Classification and Identification → speech emotional feature Class；Process C. facial expression recognition and the fusion → facial expression recognition and speech emotion recognition of speech emotion recognition are in decision level On fusion → so far complete the progressive facial expression recognition of two processes based on decision level and speech emotion recognition fusion Video feeling identification.

Embodiment illustrated in fig. 2 shows the label of 68 characteristic points of 6 specific ranges and face, is a width by characteristic point mark The example image of note, 6 specific ranges are successively vertical ranges between characteristic point 22 and 40 with D_u,1Indicate, characteristic point 45 with Vertical range between 47 is with D_u,2It indicates, the vertical range between characteristic point 37 and 49 is with D_u,3Indicate, characteristic point 34 and 52 it Between vertical range with D_u,4It indicates, the vertical range between characteristic point 52 and 58 is with D_u,5It indicates, between characteristic point 49 and 55 Horizontal distance is with D_u,6It indicates.Line in figure between characteristic point has sketched out the profile of face eyebrow, eyes and mouth region.

Embodiment illustrated in fig. 3 shows that a face in 05 database of eNTERFACE ' is utilized Dlib into Ren Te by the present invention The instance graph of sign point label, 68 characteristic points marked in the figure correspond to 68 shown in Fig. 2 human face characteristic point label schematic diagram The label of a characteristic point.

Embodiment 1

The video feeling recognition methods that facial expression recognition and speech emotion recognition are merged of the present embodiment, is a kind of base In the audiovisual emotion identification method that two processes of decision level are progressive, the specific steps are as follows:

The pretreatment of second step, image frame sequence and voice signal:

Thus the pretreatment of image frame sequence and voice signal is completed；

The normalized image frame sequence of above-mentioned second step facial image size is subjected to T characteristic point label of face, T's takes Value range is 1,2 ..., 68, the position of 68 characteristic points be it is well known, the feature dot profile of label is respectively on facial image Eyes, eyebrow, nose and mouth region, according to the coordinate of T=68 characteristic point in the present embodiment, in above-mentioned second step The normalized image frame sequence of facial image size in u frame image calculate following 6 specific ranges:

And have

V₀=[D_0,1,D_0,2,D_0,3,D_0,4,D_0,5,D_0,6] (2),

V_u=[D_u,1,D_u,2,D_u,3,D_u,4,D_u,5,D_u,6] (3),

4th step, the extraction of face textural characteristics:

LBP-TOP histogram definition such as formula (7) is shown,

5th step, the extraction of Face geometric eigenvector:

The key frame in screening image frame sequence obtained according to above-mentioned third step calculates T marked in the key frame The coordinate of characteristic point obtains the geometrical characteristic of human face expression, in facial expression recognition field, face characteristic region the most abundant It is face T font region, mainly includes eyebrow, eyes, nose, chin and mouth region, specific features point is shown in Table 2, therefore The extracting method of Face geometric eigenvector mainly extracts the distance between the mark point in face T font region feature；

Table 1 shows in face T font region, calculative human face characteristic point pair, wherein d | | p_A,p_B| | indicate feature Point is to the Euclidean distance between A, B；

Table 1

5.2nd step calculates the angle character of human face characteristic point:

Selection characterization face characteristic changes in T=68 characteristic point of the key frame obtained from the screening of above-mentioned third step 10 A angle, wherein 22 angles of eyebrow, 6 angles of eyes and mouth angles are calculated, and extract angle character, totally 10 Wei Jiao Feature is spent, Q is denoted as₁₀, specific angle is shown in Table 3, and the formula (9) for calculating characteristic point angle is as follows,

Table 2 shows in face T font region, the angle of calculative human face characteristic point.Wherein Q (p_C,p_D,p_E) table Show the angle character of angle D；

Table 2

5.3rd step calculates human face region area features:

5 regions of facial image, including left and right eyebrow, two eyes and mouth are selected, this 5 areas are calculated separately The area features in domain, specific surface area are shown in Table 3；

Table 3

Table 3 shows in face T font region, the area of calculative human face characteristic point area defined, wherein O (p_A,p_B,p_C,p_D) indicate the area in region that characteristic point A, B, C, D line surround；

Due to the otherness of everyone human face size, here by 5 human face regions in the extracted table 4 of key frame Area with the area of neutral extracted 5 human face regions of frame is corresponding subtracts each other, obtain the variation spy of facial image region area Sign, totally 5 dimensions are denoted as O₅, face brow region, face mouth region and face eye areas are set as triangle, utilize sea Human relations formula calculates each triangle area, by the Euclidean distance feature G of human face characteristic point pair₅₀, the angle character of human face characteristic point Q₁₀With human face region area features O₅It combines shown in the geometrical characteristic F such as formula (10) as face,

F=[G₅₀ Q₁₀ O₅] (10),

6th step, the grouping of human face expression:

7th step, the first subseries of facial expression recognition:

Process B is on the basis of the facial expression recognition result of process A, in conjunction with phonetic feature, respectively to above-mentioned The of each group of carry out speech emotional feature extraction of three groups in the grouping of the 6th step human face expression and speech emotion recognition Secondary classification, concrete operations are as follows:

8th step, the extraction of speech emotional feature:

First group: zero-crossing rate ZCR and logarithmic energy LogE is extracted,

Teager energy operator TEO definition such as formula (16) is shown,

Thus the extraction of speech emotional feature is completed；

9th step, the second subseries of speech emotion recognition:

Thus the second subseries of speech emotion recognition is completed；

Since speech emotion recognition is the secondary identification carried out on the basis of face emotion recognition, discrimination twice Relationship belong to the relationship of conditional probability, shown in final discrimination P (Audio_Visual) calculation method such as formula (17),

P (Audio_Visual)=P (Visual) × P (Audio | Visual) (17),

In formula (17), P (Visual) is the discrimination of first time facial image identification, and P (Audio | Visual) is the The discrimination of secondary speech emotional；

The present embodiment compares experiment to existing relevant technology on eNTERFACE ' 05 and RML database, specific to know Not other rate such as the following table 4:

Table 4

The experimental result of table 4 list in recent years audiovisual emotion recognition system on eNTERFACE ' 05 and RML database Discrimination comparison: Mahdi Bejani et al. 2014 in " Audiovisual emotion recognition using In ANOVA feature selection method and multi-classifier neural networks " document The average recognition rate for the audiovisual emotion recognition done on 05 database of eNTERFACE ' is 77.78%；

Shiqing Zhang et al. 2016 in " Multimodal Deep Convolutional Neural The audiovisual emotion done on RML database in Network for Audio-Visual Emotion Recognition " document The average recognition rate of identification is 74.32%；

Shiqing Zhang et al. 2017 is in " Learning Affective Features with a Hybrid In eNTERFACE ' 05 and RML data in Deep Model for Audio-Visual Emotion Recognition " document The average recognition rate for the audiovisual emotion recognition done on library is respectively 85.97% and 80.36.%；

Yaxiong Ma et al. 2018 " Audio-visual emotion fusion (AVEF): A deep The audiovisual emotion recognition done on eNTERFACE ' 05 and RML database in efficient weighted approach " document Average recognition rate be respectively 84.56% and 81.98%；Two based on the decision level process that the present embodiment uses is progressive Audiovisual emotion identification method, has a distinct increment on discrimination compared with paper in recent years.

In the present embodiment, the English of the voice activity detection algorithm is Voice Activity Detection, contracting It is written as VAD, the English of logarithmic energy is LogEnergy, is abbreviated as LogE；The English of zero-crossing rate is Zero-Crossing Rate is abbreviated as ZCR；The English of Teager energy operator is Teager Energy Operator, is abbreviated as TEO；Meier frequency The English of rate cepstrum coefficient is Mel-frequency cepstral coefficients, is abbreviated as MFCC, end-speech here Point detection algorithm, logarithmic energy, zero-crossing rate, Teager energy operator, mel-frequency cepstrum coefficient are that the art institute is public Know.

In the present embodiment, related calculating operation method is that those skilled in the art will appreciate that.

Claims

1. the video feeling recognition methods that facial expression recognition and speech emotion recognition are merged, it is characterised in that: be a kind of base In the audiovisual emotion identification method that two processes of decision level are progressive, the specific steps are as follows:

Process A includes the first subseries of the extraction of human face expression feature, the grouping of human face expression and facial expression recognition, step It is rapid as follows:

Video in database is resolved into image frame sequence, and carries out video pumping using the FormatFactory software of open source Voice signal in video is extracted and saves as MP3 format by frame；

The pretreatment of second step, image frame sequence and voice signal:

The image frame sequence obtained to the above-mentioned first step carries out positioning and the sanction of face using disclosed Viola&Jones algorithm It cuts, the facial image size after cutting is normalized to M × M pixel, obtains the normalized image frame sequence of facial image size；

Speech detection is carried out using well known voice activity detection algorithm VAD to the voice signal that the above-mentioned first step obtains and is removed Noise and mute section obtain the voice signal for being easier to extract feature；

Thus the pretreatment of image frame sequence and voice signal is completed；

The normalized image frame sequence of above-mentioned second step facial image size is subjected to T characteristic point label of face, the value model of T Enclose is 1,2 ..., 68, the position of 68 characteristic points is well known, the feature dot profile of the label eye on facial image respectively Eyeball, eyebrow, nose and mouth region, according to the coordinate of T characteristic point, for the facial image size normalizing in above-mentioned second step U frame image in the image frame sequence of change calculates following 6 specific ranges:

And have

In formula (1), p_iFor the coordinate set of ith feature point, p_jFor the coordinate set of j-th of characteristic point, p_i,yFor ith feature point Ordinate, p_j,yFor the ordinate of j-th of characteristic point, p_i,xFor the abscissa of ith feature point, p_j,xFor j-th characteristic point Abscissa, d_vertical||p_i,p_j| | it is characterized the vertical range between point i and j, d_horizontal||p_i,p_j| | it is characterized point i and j Between horizontal distance, i=1,2 ..., 68, j=1,2 ..., 68；

If the first frame in the normalized image frame sequence of facial image size in above-mentioned second step is neutral frame, 6 spies The set V of set a distance₀For shown in formula (2),

V₀=[D_0,1,D_0,2,D_0,3,D_0,4,D_0,5,D_0,6] (2),

In formula (2), D_0,1,D_0,2,D_0,3,D_0,4,D_0,5And D_0,6Facial image size normalization in respectively above-mentioned second step Image frame sequence in neutral frame corresponding to 6 specific ranges；

The set V of 6 specific ranges of the u frame in the normalized image frame sequence of facial image size in above-mentioned second step_u For shown in formula (3),

V_u=[D_u,1,D_u,2,D_u,3,D_u,4,D_u,5,D_u,6] (3),

In formula (3), u=1,2 ..., K-1, wherein K is the normalized one group of image of facial image size in above-mentioned second step The quantity of facial image, D in frame sequence_u,1,D_u,2,D_u,3,D_u,4,D_u,5,D_u,6Facial image ruler in respectively above-mentioned second step 6 specific ranges corresponding to u frame in very little normalized image frame sequence；

6 corresponding spies of u frame and neutral frame in the normalized image frame sequence of facial image size in above-mentioned second step The sum of set a distance ratio is that formula (4) are shown,

In formula (4), DF_uRepresent the neutral frame figure in the normalized image frame sequence of facial image size in above-mentioned second step As the ratio between 6 specific ranges corresponding with u frame image and, n represents the number of 6 specific ranges, D_0,nRepresent above-mentioned second N-th of specific range corresponding to the neutral frame in the normalized image frame sequence of facial image size in step, D_u,nIn representative State n-th of specific range corresponding to the u frame in the normalized image frame sequence of facial image size in second step；

In the normalized image frame sequence of facial image size in above-mentioned second step, according to formula (2), formula (3) and public affairs Formula (4) acquires the ratio DF of specific range corresponding to each frame image in image frame sequence, and it is right that screening obtains maximum DF institute The picture frame answered is the key frame in the image frame sequence,

4th step, the extraction of face textural characteristics:

Face textural characteristics are extracted using LBP-TOP algorithm, firstly, the facial image size in above-mentioned second step is normalized Image frame sequence, which is listed on space-time, is divided into tri- orthogonal planes of XY, XT and YT, calculates in 3 × 3 neighborhoods in each orthogonal plane The LBP value of central pixel point counts the LBP histogram feature of three orthogonal planes, finally by the LBP histogram of three orthogonal planes Figure links up to form global feature vector, and wherein LBP operator calculation method such as formula (5) and formula (6) are shown,

In formula (5) and formula (6), Z is the number of center neighborhood of pixel points point, and R is neighborhood point between central pixel point Distance, t_cFor the pixel value of center pixel, t_qFor the pixel value of q-th of neighborhood point, Sig (t_q-t_c) it is q-th of neighborhood point LBP encoded radio,

LBP-TOP histogram definition such as formula (7) is shown,

In formula (7), b is the number of plane, and b=0 is X/Y plane, and b=1 is XT plane, and b=2 is YT plane, n_bIt is at b-th The quantity of the dual mode generated in plane by LBP operator, I { LBP_Z,R,b(x, y, t)=a } it is to be used in b-th of plane LBP_Z,ROperator carries out the pixel number that LBP encoded radio when feature extraction is a；

5th step, the extraction of Face geometric eigenvector:

The key frame in screening image frame sequence obtained according to above-mentioned third step, calculates the T feature marked in the key frame The coordinate of point obtains the geometrical characteristic of human face expression, and in facial expression recognition field, face characteristic region the most abundant is people Face T font region mainly includes eyebrow, eyes, nose, chin and mouth region, therefore the extraction side of Face geometric eigenvector Method mainly extracts the distance between the mark point in face T font region feature；

14 couples of spies of eyebrow are chosen from T characteristic point in the key frame in the screening image frame sequence that above-mentioned third step obtains 6 pairs of characteristic points of point, 12 pairs of characteristic points of eyes, 12 pairs of characteristic points of mouth and nose and 6 pairs of characteristic points of chin are levied, altogether 50 pairs of characteristic points are counted, and calculate characteristic point to the Euclidean distance between A and B, totally 50 dimension Euclidean distance feature, is denoted as G₅₀, calculate Characteristic point to the formula (8) of the Euclidean distance between A and B as follows,

In formula (8), p_AIt is characterized the coordinate set of point A, p_BIt is characterized the coordinate set of point B, p_A,xIt is characterized the abscissa of point A, p_A,y It is characterized the ordinate of point A, p_B,xIt is characterized the abscissa of point B, p_B,yIt is characterized the ordinate of point B；

5.2nd step calculates the angle character of human face characteristic point:

10 angles that selection characterization face characteristic changes in T characteristic point of the key frame obtained from the screening of above-mentioned third step, Wherein 22 angles of eyebrow, 6 angles of eyes and mouth angles are calculated, and extract angle character, totally 10 dimension angle character, It is denoted as Q₁₀, the formula (9) for calculating characteristic point angle is as follows,

In formula (9), p_C、p_D、p_EIt is that eyebrow, eyes and mouth region that above-mentioned third step marks human face characteristic point are formed Three characteristic point coordinate sets corresponding to angle, wherein p_DFor corner point coordinate set；

5.3rd step calculates human face region area features:

5 regions of facial image, including left and right eyebrow, two eyes and mouth are selected, this 5 regions are calculated separately Area features, due to the otherness of everyone human face size, here by the face of extracted 5 human face regions of key frame Product is corresponding with the area of extracted 5 human face regions of neutral frame to subtract each other, and obtains the variation characteristic of facial image region area, altogether 5 dimensions are denoted as O₅, face brow region, face mouth region and face eye areas are set as triangle, utilize Helen's public affairs Formula calculates each triangle area, by the Euclidean distance feature G of human face characteristic point pair₅₀, the angle character Q of human face characteristic point₁₀With Human face region area features O₅It combines shown in the geometrical characteristic F such as formula (10) as face,

F=[G₅₀ Q₁₀ O₅] (10),

6th step, the grouping of human face expression:

By six kinds of emotions of face: surprised, fear, is angry, detesting, fast happy sadness, be divided into three groups two-by-two, specific grouping is such as Under:

7th step, the first subseries of facial expression recognition:

The human face expression feature that above-mentioned 4th step and the 5th step are extracted is put into ELM classifier and is trained and tests, by This completes the first subseries of facial expression recognition, obtains the recognition result of the first subseries of facial expression recognition, wherein ELM Parameter setting are as follows: ELM type: " classification ", hidden layer neuron number: " 20 ", activation primitive: " Sigmoid " function；

Process B is on the basis of the facial expression recognition result of process A, in conjunction with phonetic feature, respectively to the above-mentioned 6th It walks each group of three groups in the grouping of human face expression and carry out the of speech emotional feature extraction and speech emotion recognition Secondary classification, concrete operations are as follows:

8th step, the extraction of speech emotional feature:

For the classification results of the first subseries of above-mentioned 7th step facial expression recognition, according to the grouping of the 6th step, each group Emotion to the difference of the sensitivitys of different audio prosodic features, extract different prosodic features respectively:

First group: zero-crossing rate ZCR and logarithmic energy LogE is extracted,

In formula (11), Pitch is pitch, and DFT is discrete Fourier transform function, L_MThe length of voice signal is represented, It represents voice signal and adds Hamming window,Be calculated as shown in following formula (12),

In formula (13), ZCR indicates the Average zero-crossing rate of N number of window, | | it is absolute value sign, X (m) is after framing adding window The voice signal of m-th of window, sgn { X (m) } function judge the positive and negative of speech amplitude, and sgn { X (m) } function is counted by formula (14) It calculates,

In formula (15), LogE indicates that total logarithmic energy of N number of window, X (m) are m-th of windows after framing adding window Voice signal, N are number of windows；

Teager energy operator TEO definition such as formula (16) is shown,

In formula (16), ψ [X (m)] be m-th of window Teager energy operator TEO, X ' (m)=dX (m)/dm, X " (m)= dX²(m)/dm², for the signal of amplitude and frequency-invariant: X (m)=acos (φ m+ θ), wherein a is signal amplitude, and φ is signal Frequency, θ are signal initial phase angle,

The corresponding audio file of each group of picture frame of three groups in the grouping of above-mentioned 6th step human face expression is extracted public The mel-frequency cepstrum coefficient MFCC and its first-order difference feature and second differnce feature known, the rhythm for finally extracting each group Rule feature and corresponding mel-frequency cepstrum coefficient MFCC and the differential dtex of one second differnce feature of seeking peace are together in series shape At mixed audio frequency characteristics,

Thus the extraction of speech emotional feature is completed；

9th step, the second subseries of speech emotion recognition:

The speech emotional feature that above-mentioned 8th step is extracted is put into SVM to be trained and test, finally obtains speech emotional knowledge Other discrimination, the wherein parameter setting of SVM are as follows: penalty coefficient: " 95 " allow redundancy to export: " 0 ", nuclear parameter: " 1 " is supported The kernel function of vector machine: " Gaussian kernel ",

Thus the second subseries of speech emotion recognition is completed；

Due to the speech emotion recognition of above process B be carried out on the basis of the face emotion recognition of above process A it is secondary Identification, therefore the relationship of discrimination belongs to the relationship of conditional probability, final discrimination P (Audio_Visual) calculation method twice As shown in formula (17),

P (Audio_Visual)=P (Visual) × P (Audio | Visual) (17),

In formula (17), P (Visual) is the discrimination of first time facial image identification, and P (Audio/Visual) is second The discrimination of speech emotional；

So far the video of the progressive facial expression recognition of two processes based on decision level and speech emotion recognition fusion is completed Emotion recognition.

2. the video feeling recognition methods for according to claim 1 merging facial expression recognition and speech emotion recognition, It is characterized in that: the coordinate according to T characteristic point in the third step, wherein T=68.