CN109409296A - The video feeling recognition methods that facial expression recognition and speech emotion recognition are merged - Google Patents

The video feeling recognition methods that facial expression recognition and speech emotion recognition are merged Download PDF

Info

Publication number
CN109409296A
CN109409296A CN201811272233.1A CN201811272233A CN109409296A CN 109409296 A CN109409296 A CN 109409296A CN 201811272233 A CN201811272233 A CN 201811272233A CN 109409296 A CN109409296 A CN 109409296A
Authority
CN
China
Prior art keywords
formula
recognition
feature
point
mentioned
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811272233.1A
Other languages
Chinese (zh)
Other versions
CN109409296B (en
Inventor
于明
张冰
郭迎春
于洋
师硕
郝小可
朱叶
阎刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hebei University of Technology
Original Assignee
Hebei University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hebei University of Technology filed Critical Hebei University of Technology
Priority to CN201811272233.1A priority Critical patent/CN109409296B/en
Publication of CN109409296A publication Critical patent/CN109409296A/en
Application granted granted Critical
Publication of CN109409296B publication Critical patent/CN109409296B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/174Facial expression recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/172Classification, e.g. identification
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/57Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for processing of video signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Theoretical Computer Science (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Signal Processing (AREA)
  • Data Mining & Analysis (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Child & Adolescent Psychology (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • Image Analysis (AREA)

Abstract

The present invention is the video feeling recognition methods for merging facial expression recognition and speech emotion recognition, it is related to the processing of the record carrier of figure for identification, it is a kind of audiovisual emotion identification method that two based on decision level process is progressive, this method separates facial expression recognition and the speech emotion recognition in video, using the method for the progressive emotion recognition of two processes, by way of design conditions probability, the technology of speech emotion recognition is carried out on the basis of facial expression recognition;Step is: process A. is using facial image Expression Recognition as first time Classification and Identification;Process B. is using speech emotion recognition as second of Classification and Identification;The fusion of process C. facial expression recognition and speech emotion recognition.The present invention overcomes the prior arts had ignored in human emotion's identification inner link between face characteristic and phonetic feature and video feeling identification recognition speed is slow and defect that discrimination is not high.

Description

The video feeling recognition methods that facial expression recognition and speech emotion recognition are merged
Technical field
Technical solution of the present invention is related to the processing of the record carrier of figure for identification, specifically by human face expression The video feeling recognition methods of identification and speech emotion recognition fusion.
Background technique
With the rapid development of artificial intelligence and computer vision technique, human-computer interaction technology makes rapid progress, and utilizes calculating Human emotion's identification technology that machine carries out has had received widespread attention, and how computer to be made more rapidly and accurately to identify people Class emotion becomes the research hotspot of field of machine vision instantly.
The emotional expression mode of the mankind be it is diversified, mainly have human face expression, speech emotional, upper body posture and language Text etc..Wherein, human face expression and speech emotional are two kinds of emotional expression modes the most typical.Due to the texture of face and several What feature is easier to extract, based on the emotion identification method of human face expression current emotion recognition field have been able to reach compared with High discrimination.However, in the similar situation of some expressions, such as angry and detest, fear and surprised, their texture is special Geometrical characteristic of seeking peace is more similar, and it is not high to carry out knowledge method for distinguishing discrimination only by the feature for extracting human face expression.
Often there is certain limitation, bimodal or multi-modal emotion recognition side in the emotion identification method of single mode Method increasingly becomes the hot spot of the research of emotion recognition research field and concern.The key of multi-modal emotion identification method is a variety of The amalgamation mode of mode, wherein the amalgamation mode of mainstream has feature-based fusion mode and decision level fusion mode.
2012, Schuller et al. was in paper " AVEC:the continuous audio/visual emotion Audio and video feature is cascaded as single feature vector in challenge ", and uses support vector regression SVR as AVEC Multi-modal feature is carried out directly cascade building union feature vector by the baseline in 2012 challenges, this feature level fusing method.By Dimension disaster is easily caused in multi-modal feature quantity is huge, high dimensional feature is highly susceptible to the puzzlement of Sparse Problem, examines Consider the interaction between feature, combining in feature-based fusion mode will be limited the advantages of audio frequency characteristics and video features System.
Decision level fusion mode refers to that a variety of emotional expression modes can be modeled by corresponding classifier first, then The recognition result of each classifier is fused together to be formed based on decision level fusion mode, in the case where not increasing dimension, leads to The contribution degree of different emotional expressions is crossed to combine different modes.Seng et al. is in paper " A combined rule- By audiovisual emotion in based&machine learning audio-visual emotion recognition approach " Identification is divided into two mutually independent paths and extracts feature respectively, then models on respective classifier respectively, and it is respectively right to find out The discrimination answered finally obtains final discrimination according to ratio scoring and corresponding weight distribution.Existing decision level , mainly there is two o'clock in the shortcomings that amalgamation mode, and first: ratio scoring and weight distribution strategy lack unified authoritative standard, no Same researcher often obtains in same research project according to the ratio scoring and different weight distribution strategies of multiplicity Different recognition results;Second: decision level fusion mode stresses recognition of face and the fusion of speech recognition result, has ignored face Inner link between feature and phonetic feature.
CN106529504A discloses a kind of bimodal video feeling recognition methods of compound space-time characteristic, by existing volume Local binary patterns algorithm is extended to three value mode of space-time, obtains the three value mode squares of space-time part of human face expression and upper body posture Textural characteristics further merge three-dimensional gradient direction histogram feature to enhance the description to emotion video, by two kinds of feature groups Compound space-time characteristic is synthesized, this method exists when personage's upper body attitudes vibration is comparatively fast in video or upper body posture picture lacks It influences whether the realization of its algorithm, therefore combines the bimodal video feeling recognition methods of human face expression and upper body posture in feature Aspect is extracted to have some limitations.
CN105512609A disclose it is a kind of transfinited the multi-modal fusion video feeling recognition methods of learning machine based on core, it is right The image information and audio-frequency information of video carry out feature extraction, feature selecting, to obtain video features;By the multichannel of acquisition EEG signals are pre-processed, feature extraction and feature selecting, to obtain brain electrical feature;Establishing is transfinited learning machine based on core Multimodality fusion video feeling identification model;Video features and brain electrical feature are input to be transfinited the multimodality fusion of learning machine based on core Video feeling identification is carried out in video feeling identification model, obtains final classification accuracy rate.However, the algorithm exists only to three Class video feeling data classification discrimination is high, and availability owes high defect.
CN103400145B discloses a kind of voice based on clue neural network-vision fusion emotion identification method, should Method distinguishes frontal one expression, the characteristic in three channels of side countenance and voice of user first, independently A neural network is trained to execute the identification of discrete emotional category, the output layer in training process in neural network model adds Enter 4 clue nodes, the hint information of 4 coarseness classifications in liveness-evaluating deg space is carried respectively, then using more Modality fusion model merges the output result of three neural networks, and multi-modal fusion model is also using based on hint information Trained neural network, however since in most of videos, face side countenance frame number is less, it is difficult to it carries out effective Acquisition causes this method to have biggish limitation in practical operation.This method further relates to the training and fusion of neural network, With the raising of data volume increased with data dimension, the consumption of training time and resource is also being gradually increased, and error rate also can It is gradually increased.
CN105138991B discloses a kind of video feeling recognition methods based on the fusion of emotion significant characteristics, this method Audio frequency characteristics and visual emotion feature, the word-based packet model structure of audio frequency characteristics are extracted to video lens each in training video set At emotion distribution histogram feature;Visual emotion feature view-based access control model dictionary constitutes emotion attention feature, and emotion attention is special Sign with emotion distribution histogram feature carry out it is top-down merge, constitute with emotion conspicuousness video features.This method It is only extracted the feature of key frame of video when extracting visual emotion feature, is had ignored between video frame and frame to a certain extent The incidence relation of feature.
Summary of the invention
The technical problems to be solved by the present invention are: providing the video for merging facial expression recognition and speech emotion recognition Emotion identification method, is a kind of audiovisual emotion identification method that two based on decision level process is progressive, and this method is by video In facial expression recognition and speech emotion recognition separate, identified using the progressive video feeling of two processes, pass through calculating The mode of conditional probability carries out the technology of speech emotion recognition on the basis of facial expression recognition, and the present invention overcomes existing Technology has ignored inner link between face characteristic and phonetic feature and video feeling identification in human emotion identifies The defect that recognition speed is slow and discrimination is not high.
The present invention solves technical solution used by the technical problem: facial expression recognition and speech emotion recognition are melted The video feeling recognition methods of conjunction is a kind of audiovisual emotion identification method that two based on decision level process is progressive, specifically Steps are as follows:
Process A. is using facial image Expression Recognition as first time Classification and Identification:
Process A includes the first time point of the extraction of human face expression feature, the grouping of human face expression and facial expression recognition Class, steps are as follows:
Vision signal is carried out the extraction that video takes out frame and voice signal by the first step:
Video in database is resolved into image frame sequence, and is regarded using the FormatFactory software of open source Frequency takes out frame, and the voice signal in video is extracted and saves as MP3 format;
The pretreatment of second step, image frame sequence and voice signal:
The positioning of face is carried out simultaneously using disclosed Viola&Jones algorithm to the image frame sequence that the above-mentioned first step obtains It cuts, the facial image size after cutting is normalized to M × M pixel, obtains the normalized image frame sequence of facial image size;
Speech detection is carried out simultaneously using well known voice activity detection algorithm VAD to the voice signal that the above-mentioned first step obtains Noise and mute section are removed, the voice signal for being easier to extract feature is obtained;
Thus the pretreatment of image frame sequence and voice signal is completed;
Third step, image frame sequence carry out the label of human face characteristic point and screen the key frame in image frame sequence:
The normalized image frame sequence of above-mentioned second step facial image size is subjected to T characteristic point label of face, T's takes Value range is 1,2 ..., 68, the position of 68 characteristic points be it is well known, the feature dot profile of label is respectively on facial image Eyes, eyebrow, nose and mouth region, according to the coordinate of T characteristic point, for the facial image size in above-mentioned second step U frame image in normalized image frame sequence calculates following 6 specific ranges:
Distance is D in vertical direction between eyes and eyebrowu,1: Du,1=dvertical||p22,p40| |,
Distance is D in the vertical direction that eyes openu,2: Du,2=dvertical||p45,p47| |,
Distance is D in vertical direction between eyes and mouthu,3: Du,3=dvertical||p37,p49| |,
Distance is D in vertical direction between nose and mouthu,4: Du,4=dvertical||p34,p52| |,
Distance is D in the vertical direction of upper lower lipu,5: Du,5=dvertical||p52,p58| |,
Width distance is D in the horizontal direction for the two sides of mouthu,6: Du,6=dhorizontal||p49,p55| |,
And have
dvertical||pi,pj| |=| pj,y-pi,y|,dhorizontal||pi,pj| |=| pj,x-pi,x| (1),
In formula (1), piFor the coordinate set of ith feature point, pjFor the coordinate set of j-th of characteristic point, pi,yFor i-th of spy Levy the ordinate of point, pj,yFor the ordinate of j-th of characteristic point, pi,xFor the abscissa of ith feature point, pj,xFor j-th of feature The abscissa of point, dvertical||pi,pj| | it is characterized the vertical range between point i and j, dhorizontal||pi,pj| | it is characterized a little Horizontal distance between i and j, i=1,2 ..., 68, j=1,2 ..., 68;
If the first frame in the normalized image frame sequence of facial image size in above-mentioned second step is neutral frame, 6 The set V of a specific range0For shown in formula (2),
V0=[D0,1,D0,2,D0,3,D0,4,D0,5,D0,6] (2),
In formula (2), D0,1,D0,2,D0,3,D0,4,D0,5And D0,6Facial image size in respectively above-mentioned second step is returned 6 specific ranges corresponding to neutral frame in one image frame sequence changed;
6 specific ranges of the u frame in the normalized image frame sequence of facial image size in above-mentioned second step Set VuFor shown in formula (3),
Vu=[Du,1,Du,2,Du,3,Du,4,Du,5,Du,6] (3),
In formula (3), u=1,2 ..., K-1, wherein K is normalized one group of facial image size in above-mentioned second step The quantity of facial image, D in image frame sequenceu,1,Du,2,Du,3,Du,4,Du,5,Du,6Face figure in respectively above-mentioned second step 6 specific ranges as corresponding to the u frame in the normalized image frame sequence of size;
In the normalized image frame sequence of facial image size in above-mentioned second step 6 of u frame and neutral frame it is corresponding The sum of specific range ratio for shown in formula (4),
In formula (4), DFuRepresent the neutrality in the normalized image frame sequence of facial image size in above-mentioned second step The ratio between frame image 6 specific ranges corresponding with u frame image and, n represents the number of 6 specific ranges, D0,nIt represents above-mentioned N-th of specific range corresponding to the neutral frame in the normalized image frame sequence of facial image size in second step, Du,nGeneration N-th of specific range corresponding to the u frame in the normalized image frame sequence of facial image size in the above-mentioned second step of table;
In the normalized image frame sequence of facial image size in above-mentioned second step, according to formula (2), formula (3) The ratio DF of specific range corresponding to each frame image in image frame sequence is acquired with formula (4), screening obtains maximum DF Corresponding picture frame is the key frame in the image frame sequence,
Thus it completes to carry out image frame sequence the label of human face characteristic point and screens the key frame in image frame sequence;
4th step, the extraction of face textural characteristics:
Face textural characteristics are extracted using LBP-TOP algorithm, firstly, by the facial image size normalizing in above-mentioned second step The image frame sequence of change, which is listed on space-time, is divided into tri- orthogonal planes of XY, XT and YT, and it is adjacent that 3 × 3 are calculated in each orthogonal plane The LBP value of central pixel point in domain, counts the LBP histogram feature of three orthogonal planes, finally by the LBP of three orthogonal planes Histogram links up to form global feature vector, and wherein LBP operator calculation method such as formula (5) and formula (6) are shown,
In formula (5) and formula (6), Z be center neighborhood of pixel points point number, R be neighborhood point to central pixel point it Between distance, tcFor the pixel value of center pixel, tqFor the pixel value of q-th of neighborhood point, Sig (tq-tc) it is q-th of neighborhood The LBP encoded radio of point,
LBP-TOP histogram definition such as formula (7) is shown,
In formula (7), b is the number of plane, and b=0 is X/Y plane, and b=1 is XT plane, and b=2 is YT plane, nbBe The quantity of the dual mode generated in b-th of plane by LBP operator, I { LBPZ,R,b(x, y, t)=a } it is to be adopted in b-th of plane Use LBPZ,ROperator carries out the pixel number that LBP encoded radio when feature extraction is a;
5th step, the extraction of Face geometric eigenvector:
The key frame in screening image frame sequence obtained according to above-mentioned third step calculates T marked in the key frame The coordinate of characteristic point obtains the geometrical characteristic of human face expression, in facial expression recognition field, face characteristic region the most abundant It is face T font region, mainly includes eyebrow, eyes, nose, chin and mouth region, therefore Face geometric eigenvector mentions Method is taken mainly to extract the distance between the mark point in face T font region feature;
5.1st step calculates the Euclidean distance feature of human face characteristic point pair:
The 14 of eyebrow is chosen from T characteristic point in the key frame in the screening image frame sequence that above-mentioned third step obtains 6 pairs of features of 6 pairs of characteristic points and chin to characteristic point, 12 pairs of characteristic points of eyes, 12 pairs of characteristic points of mouth and nose Point amounts to 50 pairs of characteristic points, and calculates characteristic point to the Euclidean distance between A and B, and totally 50 dimension Euclidean distance feature, is denoted as G50, Characteristic point is calculated to the formula (8) of the Euclidean distance between A and B as follows,
In formula (8), pAIt is characterized the coordinate set of point A, pBIt is characterized the coordinate set of point B, pA,xIt is characterized the horizontal seat of point A Mark, pA,yIt is characterized the ordinate of point A, pB,xIt is characterized the abscissa of point B, pB,yIt is characterized the ordinate of point B;
5.2nd step calculates the angle character of human face characteristic point:
10 angles that selection characterization face characteristic changes in T characteristic point of the key frame obtained from the screening of above-mentioned third step Degree, wherein 22 angles of eyebrow, 6 angles of eyes and mouth angles are calculated, and extract angle character, totally 10 Wei Jiaodute Sign, is denoted as Q10, the formula (9) for calculating characteristic point angle is as follows,
In formula (9), pC、pD、pEIt is eyebrow, eyes and the mouth region shape that above-mentioned third step marks human face characteristic point At angle corresponding to three characteristic point coordinate sets, wherein pDFor corner point coordinate set;
5.3rd step calculates human face region area features:
5 regions of facial image, including left and right eyebrow, two eyes and mouth are selected, this 5 areas are calculated separately The area features in domain, due to the otherness of everyone human face size, here by extracted 5 human face regions of key frame Area with the area of neutral extracted 5 human face regions of frame is corresponding subtracts each other, obtain the variation spy of facial image region area Sign, totally 5 dimensions are denoted as O5, face brow region, face mouth region and face eye areas are set as triangle, utilize sea Human relations formula calculates each triangle area, by the Euclidean distance feature G of human face characteristic point pair50, the angle character of human face characteristic point Q10With human face region area features O5It combines shown in the geometrical characteristic F such as formula (10) as face,
F=[G50 Q10 O5] (10),
So far, series connection face textural characteristics and Face geometric eigenvector complete the extraction of human face expression feature;
6th step, the grouping of human face expression:
By six kinds of emotions of face: it is surprised, fear, is angry, detesting, fast happy sadness, it is divided into three groups two-by-two, it is specific to be grouped It is as follows:
First group: surprised, fear;Second group: angry, detest;Third group: happy, sad;
7th step, the first subseries of facial expression recognition:
The human face expression feature that above-mentioned 4th step and the 5th step are extracted is put into ELM classifier and is trained and surveys Thus examination completes the first subseries of facial expression recognition, obtains the recognition result of the first subseries of facial expression recognition, The parameter setting of middle ELM are as follows: ELM type: " classification ", hidden layer neuron number: " 20 ", activation primitive: " Sigmoid " letter Number;
Process B. is using speech emotion recognition as second of Classification and Identification:
Process B is on the basis of the facial expression recognition result of process A, in conjunction with phonetic feature, respectively to above-mentioned Each group of three groups in the grouping of 6th step human face expression carries out carrying out speech emotional feature extraction and speech emotion recognition The second subseries, concrete operations are as follows:
8th step, the extraction of speech emotional feature:
For the classification results of the first subseries of above-mentioned 7th step facial expression recognition, according to the grouping of the 6th step, often One group of emotion extracts different prosodic features to the difference of the sensitivity of different audio prosodic features respectively:
First group: zero-crossing rate ZCR and logarithmic energy LogE is extracted,
Second group: Teager energy operator TEO, zero-crossing rate ZCR, logarithmic energy LogE are extracted,
Third group: extracting pitch Pitch, zero-crossing rate ZCR, Teager energy operator TEO,
Pitch Pitch is to calculate in a frequency domain in above-mentioned prosodic features,
Voice signal M pretreated for above-mentioned second step calculates pitch Pitch with following formula (11),
In formula (11), Pitch is pitch, and DFT is discrete Fourier transform function, LMThe length of voice signal is represented,It represents voice signal and adds Hamming window,Be calculated as shown in following formula (12),
In formula (12), N is the quantity of Hamming window, and m is m-th of Hamming window;
Shown in the calculating such as formula (13) of zero-crossing rate ZCR in above-mentioned prosodic features,
In formula (13), ZCR indicates the Average zero-crossing rate of N number of window, | | be absolute value sign, X (m) be framing adding window it The voice signal of m-th of window afterwards, sgn { X (m) } function judge the positive and negative of speech amplitude, and sgn { X (m) } function is by formula (14) it calculates,
In formula (14), X (m) is the voice signal of m-th of window after framing adding window;
The calculation formula (15) of above-mentioned logarithmic energy LogE is as follows,
In formula (15), LogE indicates that total logarithmic energy of N number of window, X (m) are m-th of windows after framing adding window The voice signal of mouth, N is number of windows;
Teager energy operator TEO definition such as formula (16) is shown,
In formula (16), ψ [X (m)] is Teager energy operator TEO, X ' (m)=dX (the m)/dm, X " of m-th of window (m)=dX2(m)/dm2, for the signal of amplitude and frequency-invariant: X (m)=acos (φ m+ θ), wherein a is signal amplitude, φ For signal frequency, θ is signal initial phase angle,
Each group of the corresponding audio file of picture frame of three groups in the grouping of above-mentioned 6th step human face expression is mentioned Well known mel-frequency cepstrum coefficient MFCC and its first-order difference feature and second differnce feature are taken, finally extracts each group Prosodic features and corresponding mel-frequency cepstrum coefficient MFCC and the differential dtex of one second differnce feature of seeking peace connect Form the audio frequency characteristics of mixing,
Thus the extraction of speech emotional feature is completed;
9th step, the second subseries of speech emotion recognition:
The speech emotional feature that above-mentioned 8th step is extracted is put into SVM to be trained and test, finally obtains voice feelings The other discrimination of perception, the wherein parameter setting of SVM are as follows: penalty coefficient: " 95 " allow redundancy to export: " 0 ", nuclear parameter: and " 1 ", The kernel function of support vector machines: " Gaussian kernel ",
Thus the second subseries of speech emotion recognition is completed;
The fusion of process C. facial expression recognition and speech emotion recognition:
The fusion of tenth step, facial expression recognition and speech emotion recognition in decision level:
Since the speech emotion recognition of above process B is carried out on the basis of the face emotion recognition of above process A Secondary identification, therefore the relationship of discrimination belongs to the relationship of conditional probability twice, final discrimination P (Audio_Visual) calculates Shown in method such as formula (17),
P (Audio_Visual)=P (Visual) × P (Audio | Visual) (17),
In formula (17), P (Visual) is the discrimination of first time facial image identification, and P (Audio/Visual) is the The discrimination of secondary speech emotional;
So far complete the progressive facial expression recognition of two processes based on decision level and speech emotion recognition fusion Video feeling identification.
The above-mentioned video feeling recognition methods for merging facial expression recognition and speech emotion recognition, in the third step According to the coordinate of T characteristic point, wherein T=68.
The above-mentioned video feeling recognition methods for merging facial expression recognition and speech emotion recognition, the sound end inspection The English of method of determining and calculating is Voice Activity Detection, is abbreviated as VAD;The English of zero-crossing rate is Zero-Crossing Rate is abbreviated as ZCR;The English of logarithmic energy is LogEnergy, is abbreviated as LogE;The English of mel-frequency cepstrum coefficient is Mel-frequency cepstral coefficients, is abbreviated as MFCC;The English of Teager energy operator is Teager Energy Operator is abbreviated as TEO, voice activity detection algorithm here, zero-crossing rate, logarithmic energy, mel-frequency cepstrum Coefficient, Teager energy operator are well known to the art.
The above-mentioned video feeling recognition methods for merging facial expression recognition and speech emotion recognition, related calculating behaviour Make method, is that those skilled in the art will appreciate that.
The beneficial effects of the present invention are: compared with prior art, substantive distinguishing features outstanding of the invention and marked improvement It is as follows:
(1) present invention provides the video feeling recognition methods for merging facial expression recognition and speech emotion recognition, is one Kind of the audiovisual emotion identification method based on decision level, this method is by the facial expression recognition and speech emotion recognition point in video It opens, using the method for the progressive emotion recognition of two processes, by the calculation of conditional probability, in facial expression recognition On the basis of the speech emotion recognition technology that carries out, fully take into account shadow of the result to speech emotion recognition of facial expression recognition It rings, so that facial expression recognition and speech emotion recognition merge more closely, the two is mutually assisted, and is achieved more preferably Human emotion's recognition effect, overcome it is of the existing technology human emotion identification in have ignored face characteristic and phonetic feature Between inner link, the defect that the recognition speed of video feeling identification is slow and discrimination is not high.
(2) different emotions has different sensitivitys to different prosodic features, and Chien et al. is in 2014 in " A The experiment of one " acoustic characteristic analysis " has been done in a new approach of audio emotion recognition " text, 6 kinds of emotions of the experimental demonstration are to prosodic features, i.e. Pitch, Zero-Crossing Rate, LogEnergy, Teager The sensitivity of Energy Operator is different.Mel-frequency cepstrum coefficient Mel-Frequency of the paper to audio extraction Cepstral Coefficients, MFCC, are classified using SVM classifier, when the classification of carry out two, four classification and six classification When discrimination successively gradually decrease, point classification it is fewer, the classifying quality of classifier is better, thus in the present invention we the Three classification of selection when facial expression classification, two classification of selection when second of audio classification.The method of the present invention will be polytypic The problem of problem reduction is three classification and two classification, had not only reduced characteristic dimension but also had shortened the training time, be greatly improved The efficiency of algorithm.
(3) compared with CN106529504A, the method for the present invention is advantageously not only extracted in video the present invention Face characteristic, the audio frequency characteristics being also extracted in video, being combined with for face characteristic and audio frequency characteristics bimodal be conducive to view The emotion of people in frequency does more accurate identification.
(4) for the present invention compared with CN105512609A, method proposed in CN105512609A only can recognize that video In three kinds of emotions, and the present invention can identify six kinds of emotions in video, and average recognition rate of the invention compared to Video feeling discrimination in CN105512609A is high by 9.92%.
(5) present invention is compared with CN105138991A, and the method for the present invention is advantageously by face characteristic and voice Feature is classified respectively, " the dimension disaster " be easy to causeing when avoiding feature-based fusion, the fusion method in decision level, operation Simply, it trains with recognition speed faster.
(6) present invention considers that different emotions are different to the sensitivity of different audio frequency characteristics when extracting audio frequency characteristics, So that different audio frequency characteristics are extracted in each grouping, be conducive to the second subseries based on phonetic feature.
(7) present invention is extracted texture, geometry, time, prosodic features, and different characteristic reflects the different characteristics of expression, Classifier can be preferably trained, carries out video feeling identification from multiple modalities.
(8) present invention has used secondary progressive sensibility classification method, based on recognition of face, supplemented by speech recognition, and two Person is complementary, mutually assists, and more accurately video feeling identification may be implemented.
Detailed description of the invention
Present invention will be further explained below with reference to the attached drawings and examples.
Fig. 1 is the method for the present invention schematic process flow diagram.
Fig. 2 is the label schematic diagram of 68 characteristic points of 6 specific ranges and face.
Fig. 3 is the instance graph of 68 characteristic points label of a face in 05 database of eNTERFACE '.
Specific embodiment
Embodiment illustrated in fig. 1 shows the process of the method for the present invention are as follows: process A. is using facial image Expression Recognition as first Subseries identification → by vision signal carry out video take out frame and voice signal extraction → image frame sequence and voice signal it is pre- Processing → the label of human face characteristic point is carried out to image frame sequence and screens the key frame in image frame sequence → face texture spy Extraction → Face geometric eigenvector extraction → human face expression the first subseries of grouping → facial expression recognition of sign;Process B. Using speech emotion recognition as second point of extraction → speech emotion recognition of second of Classification and Identification → speech emotional feature Class;Process C. facial expression recognition and the fusion → facial expression recognition and speech emotion recognition of speech emotion recognition are in decision level On fusion → so far complete the progressive facial expression recognition of two processes based on decision level and speech emotion recognition fusion Video feeling identification.
Embodiment illustrated in fig. 2 shows the label of 68 characteristic points of 6 specific ranges and face, is a width by characteristic point mark The example image of note, 6 specific ranges are successively vertical ranges between characteristic point 22 and 40 with Du,1Indicate, characteristic point 45 with Vertical range between 47 is with Du,2It indicates, the vertical range between characteristic point 37 and 49 is with Du,3Indicate, characteristic point 34 and 52 it Between vertical range with Du,4It indicates, the vertical range between characteristic point 52 and 58 is with Du,5It indicates, between characteristic point 49 and 55 Horizontal distance is with Du,6It indicates.Line in figure between characteristic point has sketched out the profile of face eyebrow, eyes and mouth region.
Embodiment illustrated in fig. 3 shows that a face in 05 database of eNTERFACE ' is utilized Dlib into Ren Te by the present invention The instance graph of sign point label, 68 characteristic points marked in the figure correspond to 68 shown in Fig. 2 human face characteristic point label schematic diagram The label of a characteristic point.
Embodiment 1
The video feeling recognition methods that facial expression recognition and speech emotion recognition are merged of the present embodiment, is a kind of base In the audiovisual emotion identification method that two processes of decision level are progressive, the specific steps are as follows:
Process A. is using facial image Expression Recognition as first time Classification and Identification:
Process A includes the first time point of the extraction of human face expression feature, the grouping of human face expression and facial expression recognition Class, steps are as follows:
Vision signal is carried out the extraction that video takes out frame and voice signal by the first step:
Video in database is resolved into image frame sequence, and is regarded using the FormatFactory software of open source Frequency takes out frame, and the voice signal in video is extracted and saves as MP3 format;
The pretreatment of second step, image frame sequence and voice signal:
The positioning of face is carried out simultaneously using disclosed Viola&Jones algorithm to the image frame sequence that the above-mentioned first step obtains It cuts, the facial image size after cutting is normalized to M × M pixel, obtains the normalized image frame sequence of facial image size;
Speech detection is carried out simultaneously using well known voice activity detection algorithm VAD to the voice signal that the above-mentioned first step obtains Noise and mute section are removed, the voice signal for being easier to extract feature is obtained;
Thus the pretreatment of image frame sequence and voice signal is completed;
Third step, image frame sequence carry out the label of human face characteristic point and screen the key frame in image frame sequence:
The normalized image frame sequence of above-mentioned second step facial image size is subjected to T characteristic point label of face, T's takes Value range is 1,2 ..., 68, the position of 68 characteristic points be it is well known, the feature dot profile of label is respectively on facial image Eyes, eyebrow, nose and mouth region, according to the coordinate of T=68 characteristic point in the present embodiment, in above-mentioned second step The normalized image frame sequence of facial image size in u frame image calculate following 6 specific ranges:
Distance is D in vertical direction between eyes and eyebrowu,1: Du,1=dvertical||p22,p40| |,
Distance is D in the vertical direction that eyes openu,2: Du,2=dvertical||p45,p47| |,
Distance is D in vertical direction between eyes and mouthu,3: Du,3=dvertical||p37,p49| |,
Distance is D in vertical direction between nose and mouthu,4: Du,4=dvertical||p34,p52| |,
Distance is D in the vertical direction of upper lower lipu,5: Du,5=dvertical||p52,p58| |,
Width distance is D in the horizontal direction for the two sides of mouthu,6: Du,6=dhorizontal||p49,p55| |,
And have
dvertical||pi,pj| |=| pj,y-pi,y|,dhorizontal||pi,pj| |=| pj,x-pi,x| (1),
In formula (1), piFor the coordinate set of ith feature point, pjFor the coordinate set of j-th of characteristic point, pi,yFor i-th of spy Levy the ordinate of point, pj,yFor the ordinate of j-th of characteristic point, pi,xFor the abscissa of ith feature point, pj,xFor j-th of feature The abscissa of point, dvertical||pi,pj| | it is characterized the vertical range between point i and j, dhorizontal||pi,pj| | it is characterized a little Horizontal distance between i and j, i=1,2 ..., 68, j=1,2 ..., 68;
If the first frame in the normalized image frame sequence of facial image size in above-mentioned second step is neutral frame, 6 The set V of a specific range0For shown in formula (2),
V0=[D0,1,D0,2,D0,3,D0,4,D0,5,D0,6] (2),
In formula (2), D0,1,D0,2,D0,3,D0,4,D0,5And D0,6Facial image size in respectively above-mentioned second step is returned 6 specific ranges corresponding to neutral frame in one image frame sequence changed;
6 specific ranges of the u frame in the normalized image frame sequence of facial image size in above-mentioned second step Set VuFor shown in formula (3),
Vu=[Du,1,Du,2,Du,3,Du,4,Du,5,Du,6] (3),
In formula (3), u=1,2 ..., K-1, wherein K is normalized one group of facial image size in above-mentioned second step The quantity of facial image, D in image frame sequenceu,1,Du,2,Du,3,Du,4,Du,5,Du,6Face figure in respectively above-mentioned second step 6 specific ranges as corresponding to the u frame in the normalized image frame sequence of size;
In the normalized image frame sequence of facial image size in above-mentioned second step 6 of u frame and neutral frame it is corresponding The sum of specific range ratio for shown in formula (4),
In formula (4), DFuRepresent the neutrality in the normalized image frame sequence of facial image size in above-mentioned second step The ratio between frame image 6 specific ranges corresponding with u frame image and, n represents the number of 6 specific ranges, D0,nIt represents above-mentioned N-th of specific range corresponding to the neutral frame in the normalized image frame sequence of facial image size in second step, Du,nGeneration N-th of specific range corresponding to the u frame in the normalized image frame sequence of facial image size in the above-mentioned second step of table;
In the normalized image frame sequence of facial image size in above-mentioned second step, according to formula (2), formula (3) The ratio DF of specific range corresponding to each frame image in image frame sequence is acquired with formula (4), screening obtains maximum DF Corresponding picture frame is the key frame in the image frame sequence,
Thus it completes to carry out image frame sequence the label of human face characteristic point and screens the key frame in image frame sequence;
4th step, the extraction of face textural characteristics:
Face textural characteristics are extracted using LBP-TOP algorithm, firstly, by the facial image size normalizing in above-mentioned second step The image frame sequence of change, which is listed on space-time, is divided into tri- orthogonal planes of XY, XT and YT, and it is adjacent that 3 × 3 are calculated in each orthogonal plane The LBP value of central pixel point in domain, counts the LBP histogram feature of three orthogonal planes, finally by the LBP of three orthogonal planes Histogram links up to form global feature vector, and wherein LBP operator calculation method such as formula (5) and formula (6) are shown,
In formula (5) and formula (6), Z be center neighborhood of pixel points point number, R be neighborhood point to central pixel point it Between distance, tcFor the pixel value of center pixel, tqFor the pixel value of q-th of neighborhood point, Sig (tq-tc) it is q-th of neighborhood The LBP encoded radio of point,
LBP-TOP histogram definition such as formula (7) is shown,
In formula (7), b is the number of plane, and b=0 is X/Y plane, and b=1 is XT plane, and b=2 is YT plane, nbBe The quantity of the dual mode generated in b-th of plane by LBP operator, I { LBPZ,R,b(x, y, t)=a } it is to be adopted in b-th of plane Use LBPZ,ROperator carries out the pixel number that LBP encoded radio when feature extraction is a;
5th step, the extraction of Face geometric eigenvector:
The key frame in screening image frame sequence obtained according to above-mentioned third step calculates T marked in the key frame The coordinate of characteristic point obtains the geometrical characteristic of human face expression, in facial expression recognition field, face characteristic region the most abundant It is face T font region, mainly includes eyebrow, eyes, nose, chin and mouth region, specific features point is shown in Table 2, therefore The extracting method of Face geometric eigenvector mainly extracts the distance between the mark point in face T font region feature;
5.1st step calculates the Euclidean distance feature of human face characteristic point pair:
The 14 of eyebrow is chosen from T characteristic point in the key frame in the screening image frame sequence that above-mentioned third step obtains 6 pairs of features of 6 pairs of characteristic points and chin to characteristic point, 12 pairs of characteristic points of eyes, 12 pairs of characteristic points of mouth and nose Point amounts to 50 pairs of characteristic points, and calculates characteristic point to the Euclidean distance between A and B, and totally 50 dimension Euclidean distance feature, is denoted as G50, Characteristic point is calculated to the formula (8) of the Euclidean distance between A and B as follows,
In formula (8), pAIt is characterized the coordinate set of point A, pBIt is characterized the coordinate set of point B, pA,xIt is characterized the horizontal seat of point A Mark, pA,yIt is characterized the ordinate of point A, pB,xIt is characterized the abscissa of point B, pB,yIt is characterized the ordinate of point B;
Table 1 shows in face T font region, calculative human face characteristic point pair, wherein d | | pA,pB| | indicate feature Point is to the Euclidean distance between A, B;
Table 1
5.2nd step calculates the angle character of human face characteristic point:
Selection characterization face characteristic changes in T=68 characteristic point of the key frame obtained from the screening of above-mentioned third step 10 A angle, wherein 22 angles of eyebrow, 6 angles of eyes and mouth angles are calculated, and extract angle character, totally 10 Wei Jiao Feature is spent, Q is denoted as10, specific angle is shown in Table 3, and the formula (9) for calculating characteristic point angle is as follows,
In formula (9), pC、pD、pEIt is eyebrow, eyes and the mouth region shape that above-mentioned third step marks human face characteristic point At angle corresponding to three characteristic point coordinate sets, wherein pDFor corner point coordinate set;
Table 2 shows in face T font region, the angle of calculative human face characteristic point.Wherein Q (pC,pD,pE) table Show the angle character of angle D;
Table 2
5.3rd step calculates human face region area features:
5 regions of facial image, including left and right eyebrow, two eyes and mouth are selected, this 5 areas are calculated separately The area features in domain, specific surface area are shown in Table 3;
Table 3
Table 3 shows in face T font region, the area of calculative human face characteristic point area defined, wherein O (pA,pB,pC,pD) indicate the area in region that characteristic point A, B, C, D line surround;
Due to the otherness of everyone human face size, here by 5 human face regions in the extracted table 4 of key frame Area with the area of neutral extracted 5 human face regions of frame is corresponding subtracts each other, obtain the variation spy of facial image region area Sign, totally 5 dimensions are denoted as O5, face brow region, face mouth region and face eye areas are set as triangle, utilize sea Human relations formula calculates each triangle area, by the Euclidean distance feature G of human face characteristic point pair50, the angle character of human face characteristic point Q10With human face region area features O5It combines shown in the geometrical characteristic F such as formula (10) as face,
F=[G50 Q10 O5] (10),
So far, series connection face textural characteristics and Face geometric eigenvector complete the extraction of human face expression feature;
6th step, the grouping of human face expression:
By six kinds of emotions of face: it is surprised, fear, is angry, detesting, fast happy sadness, it is divided into three groups two-by-two, it is specific to be grouped It is as follows:
First group: surprised, fear;Second group: angry, detest;Third group: happy, sad;
7th step, the first subseries of facial expression recognition:
The human face expression feature that above-mentioned 4th step and the 5th step are extracted is put into ELM classifier and is trained and surveys Thus examination completes the first subseries of facial expression recognition, obtains the recognition result of the first subseries of facial expression recognition, The parameter setting of middle ELM are as follows: ELM type: " classification ", hidden layer neuron number: " 20 ", activation primitive: " Sigmoid " letter Number;
Process B. is using speech emotion recognition as second of Classification and Identification:
Process B is on the basis of the facial expression recognition result of process A, in conjunction with phonetic feature, respectively to above-mentioned The of each group of carry out speech emotional feature extraction of three groups in the grouping of the 6th step human face expression and speech emotion recognition Secondary classification, concrete operations are as follows:
8th step, the extraction of speech emotional feature:
For the classification results of the first subseries of above-mentioned 7th step facial expression recognition, according to the grouping of the 6th step, often One group of emotion extracts different prosodic features to the difference of the sensitivity of different audio prosodic features respectively:
First group: zero-crossing rate ZCR and logarithmic energy LogE is extracted,
Second group: Teager energy operator TEO, zero-crossing rate ZCR, logarithmic energy LogE are extracted,
Third group: extracting pitch Pitch, zero-crossing rate ZCR, Teager energy operator TEO,
Pitch Pitch is to calculate in a frequency domain in above-mentioned prosodic features,
Voice signal M pretreated for above-mentioned second step calculates pitch Pitch with following formula (11),
In formula (11), Pitch is pitch, and DFT is discrete Fourier transform function, LMThe length of voice signal is represented,It represents voice signal and adds Hamming window,Be calculated as shown in following formula (12),
In formula (12), N is the quantity of Hamming window, and m is m-th of Hamming window;
Shown in the calculating such as formula (13) of zero-crossing rate ZCR in above-mentioned prosodic features,
In formula (13), ZCR indicates the Average zero-crossing rate of N number of window, | | be absolute value sign, X (m) be framing adding window it The voice signal of m-th of window afterwards, sgn { X (m) } function judge the positive and negative of speech amplitude, and sgn { X (m) } function is by formula (14) it calculates,
In formula (14), X (m) is the voice signal of m-th of window after framing adding window;
The calculation formula (15) of above-mentioned logarithmic energy LogE is as follows,
In formula (15), LogE indicates that total logarithmic energy of N number of window, X (m) are m-th of windows after framing adding window The voice signal of mouth, N is number of windows;
Teager energy operator TEO definition such as formula (16) is shown,
In formula (16), ψ [X (m)] is Teager energy operator TEO, X ' (m)=dX (the m)/dm, X " of m-th of window (m)=dX2(m)/dm2, for the signal of amplitude and frequency-invariant: X (m)=acos (φ m+ θ), wherein a is signal amplitude, φ For signal frequency, θ is signal initial phase angle,
Each group of the corresponding audio file of picture frame of three groups in the grouping of above-mentioned 6th step human face expression is mentioned Well known mel-frequency cepstrum coefficient MFCC and its first-order difference feature and second differnce feature are taken, finally extracts each group Prosodic features and corresponding mel-frequency cepstrum coefficient MFCC and the differential dtex of one second differnce feature of seeking peace connect Form the audio frequency characteristics of mixing,
Thus the extraction of speech emotional feature is completed;
9th step, the second subseries of speech emotion recognition:
The speech emotional feature that above-mentioned 8th step is extracted is put into SVM to be trained and test, finally obtains voice feelings The other discrimination of perception, the wherein parameter setting of SVM are as follows: penalty coefficient: " 95 " allow redundancy to export: " 0 ", nuclear parameter: and " 1 ", The kernel function of support vector machines: " Gaussian kernel ",
Thus the second subseries of speech emotion recognition is completed;
The fusion of process C. facial expression recognition and speech emotion recognition:
The fusion of tenth step, facial expression recognition and speech emotion recognition in decision level:
Since speech emotion recognition is the secondary identification carried out on the basis of face emotion recognition, discrimination twice Relationship belong to the relationship of conditional probability, shown in final discrimination P (Audio_Visual) calculation method such as formula (17),
P (Audio_Visual)=P (Visual) × P (Audio | Visual) (17),
In formula (17), P (Visual) is the discrimination of first time facial image identification, and P (Audio | Visual) is the The discrimination of secondary speech emotional;
So far complete the progressive facial expression recognition of two processes based on decision level and speech emotion recognition fusion Video feeling identification.
The present embodiment compares experiment to existing relevant technology on eNTERFACE ' 05 and RML database, specific to know Not other rate such as the following table 4:
Table 4
The experimental result of table 4 list in recent years audiovisual emotion recognition system on eNTERFACE ' 05 and RML database Discrimination comparison: Mahdi Bejani et al. 2014 in " Audiovisual emotion recognition using In ANOVA feature selection method and multi-classifier neural networks " document The average recognition rate for the audiovisual emotion recognition done on 05 database of eNTERFACE ' is 77.78%;
Shiqing Zhang et al. 2016 in " Multimodal Deep Convolutional Neural The audiovisual emotion done on RML database in Network for Audio-Visual Emotion Recognition " document The average recognition rate of identification is 74.32%;
Shiqing Zhang et al. 2017 is in " Learning Affective Features with a Hybrid In eNTERFACE ' 05 and RML data in Deep Model for Audio-Visual Emotion Recognition " document The average recognition rate for the audiovisual emotion recognition done on library is respectively 85.97% and 80.36.%;
Yaxiong Ma et al. 2018 " Audio-visual emotion fusion (AVEF): A deep The audiovisual emotion recognition done on eNTERFACE ' 05 and RML database in efficient weighted approach " document Average recognition rate be respectively 84.56% and 81.98%;Two based on the decision level process that the present embodiment uses is progressive Audiovisual emotion identification method, has a distinct increment on discrimination compared with paper in recent years.
In the present embodiment, the English of the voice activity detection algorithm is Voice Activity Detection, contracting It is written as VAD, the English of logarithmic energy is LogEnergy, is abbreviated as LogE;The English of zero-crossing rate is Zero-Crossing Rate is abbreviated as ZCR;The English of Teager energy operator is Teager Energy Operator, is abbreviated as TEO;Meier frequency The English of rate cepstrum coefficient is Mel-frequency cepstral coefficients, is abbreviated as MFCC, end-speech here Point detection algorithm, logarithmic energy, zero-crossing rate, Teager energy operator, mel-frequency cepstrum coefficient are that the art institute is public Know.
In the present embodiment, related calculating operation method is that those skilled in the art will appreciate that.

Claims (2)

1. the video feeling recognition methods that facial expression recognition and speech emotion recognition are merged, it is characterised in that: be a kind of base In the audiovisual emotion identification method that two processes of decision level are progressive, the specific steps are as follows:
Process A. is using facial image Expression Recognition as first time Classification and Identification:
Process A includes the first subseries of the extraction of human face expression feature, the grouping of human face expression and facial expression recognition, step It is rapid as follows:
Vision signal is carried out the extraction that video takes out frame and voice signal by the first step:
Video in database is resolved into image frame sequence, and carries out video pumping using the FormatFactory software of open source Voice signal in video is extracted and saves as MP3 format by frame;
The pretreatment of second step, image frame sequence and voice signal:
The image frame sequence obtained to the above-mentioned first step carries out positioning and the sanction of face using disclosed Viola&Jones algorithm It cuts, the facial image size after cutting is normalized to M × M pixel, obtains the normalized image frame sequence of facial image size;
Speech detection is carried out using well known voice activity detection algorithm VAD to the voice signal that the above-mentioned first step obtains and is removed Noise and mute section obtain the voice signal for being easier to extract feature;
Thus the pretreatment of image frame sequence and voice signal is completed;
Third step, image frame sequence carry out the label of human face characteristic point and screen the key frame in image frame sequence:
The normalized image frame sequence of above-mentioned second step facial image size is subjected to T characteristic point label of face, the value model of T Enclose is 1,2 ..., 68, the position of 68 characteristic points is well known, the feature dot profile of the label eye on facial image respectively Eyeball, eyebrow, nose and mouth region, according to the coordinate of T characteristic point, for the facial image size normalizing in above-mentioned second step U frame image in the image frame sequence of change calculates following 6 specific ranges:
Distance is D in vertical direction between eyes and eyebrowu,1: Du,1=dvertical||p22,p40| |,
Distance is D in the vertical direction that eyes openu,2: Du,2=dvertical||p45,p47| |,
Distance is D in vertical direction between eyes and mouthu,3: Du,3=dvertical||p37,p49| |,
Distance is D in vertical direction between nose and mouthu,4: Du,4=dvertical||p34,p52| |,
Distance is D in the vertical direction of upper lower lipu,5: Du,5=dvertical||p52,p58| |,
Width distance is D in the horizontal direction for the two sides of mouthu,6: Du,6=dhorizontal||p49,p55| |,
And have
dvertical||pi,pj| |=| pj,y-pi,y|,dhorizontal||pi,pj| |=| pj,x-pi,x| (1),
In formula (1), piFor the coordinate set of ith feature point, pjFor the coordinate set of j-th of characteristic point, pi,yFor ith feature point Ordinate, pj,yFor the ordinate of j-th of characteristic point, pi,xFor the abscissa of ith feature point, pj,xFor j-th characteristic point Abscissa, dvertical||pi,pj| | it is characterized the vertical range between point i and j, dhorizontal||pi,pj| | it is characterized point i and j Between horizontal distance, i=1,2 ..., 68, j=1,2 ..., 68;
If the first frame in the normalized image frame sequence of facial image size in above-mentioned second step is neutral frame, 6 spies The set V of set a distance0For shown in formula (2),
V0=[D0,1,D0,2,D0,3,D0,4,D0,5,D0,6] (2),
In formula (2), D0,1,D0,2,D0,3,D0,4,D0,5And D0,6Facial image size normalization in respectively above-mentioned second step Image frame sequence in neutral frame corresponding to 6 specific ranges;
The set V of 6 specific ranges of the u frame in the normalized image frame sequence of facial image size in above-mentioned second stepu For shown in formula (3),
Vu=[Du,1,Du,2,Du,3,Du,4,Du,5,Du,6] (3),
In formula (3), u=1,2 ..., K-1, wherein K is the normalized one group of image of facial image size in above-mentioned second step The quantity of facial image, D in frame sequenceu,1,Du,2,Du,3,Du,4,Du,5,Du,6Facial image ruler in respectively above-mentioned second step 6 specific ranges corresponding to u frame in very little normalized image frame sequence;
6 corresponding spies of u frame and neutral frame in the normalized image frame sequence of facial image size in above-mentioned second step The sum of set a distance ratio is that formula (4) are shown,
In formula (4), DFuRepresent the neutral frame figure in the normalized image frame sequence of facial image size in above-mentioned second step As the ratio between 6 specific ranges corresponding with u frame image and, n represents the number of 6 specific ranges, D0,nRepresent above-mentioned second N-th of specific range corresponding to the neutral frame in the normalized image frame sequence of facial image size in step, Du,nIn representative State n-th of specific range corresponding to the u frame in the normalized image frame sequence of facial image size in second step;
In the normalized image frame sequence of facial image size in above-mentioned second step, according to formula (2), formula (3) and public affairs Formula (4) acquires the ratio DF of specific range corresponding to each frame image in image frame sequence, and it is right that screening obtains maximum DF institute The picture frame answered is the key frame in the image frame sequence,
Thus it completes to carry out image frame sequence the label of human face characteristic point and screens the key frame in image frame sequence;
4th step, the extraction of face textural characteristics:
Face textural characteristics are extracted using LBP-TOP algorithm, firstly, the facial image size in above-mentioned second step is normalized Image frame sequence, which is listed on space-time, is divided into tri- orthogonal planes of XY, XT and YT, calculates in 3 × 3 neighborhoods in each orthogonal plane The LBP value of central pixel point counts the LBP histogram feature of three orthogonal planes, finally by the LBP histogram of three orthogonal planes Figure links up to form global feature vector, and wherein LBP operator calculation method such as formula (5) and formula (6) are shown,
In formula (5) and formula (6), Z is the number of center neighborhood of pixel points point, and R is neighborhood point between central pixel point Distance, tcFor the pixel value of center pixel, tqFor the pixel value of q-th of neighborhood point, Sig (tq-tc) it is q-th of neighborhood point LBP encoded radio,
LBP-TOP histogram definition such as formula (7) is shown,
In formula (7), b is the number of plane, and b=0 is X/Y plane, and b=1 is XT plane, and b=2 is YT plane, nbIt is at b-th The quantity of the dual mode generated in plane by LBP operator, I { LBPZ,R,b(x, y, t)=a } it is to be used in b-th of plane LBPZ,ROperator carries out the pixel number that LBP encoded radio when feature extraction is a;
5th step, the extraction of Face geometric eigenvector:
The key frame in screening image frame sequence obtained according to above-mentioned third step, calculates the T feature marked in the key frame The coordinate of point obtains the geometrical characteristic of human face expression, and in facial expression recognition field, face characteristic region the most abundant is people Face T font region mainly includes eyebrow, eyes, nose, chin and mouth region, therefore the extraction side of Face geometric eigenvector Method mainly extracts the distance between the mark point in face T font region feature;
5.1st step calculates the Euclidean distance feature of human face characteristic point pair:
14 couples of spies of eyebrow are chosen from T characteristic point in the key frame in the screening image frame sequence that above-mentioned third step obtains 6 pairs of characteristic points of point, 12 pairs of characteristic points of eyes, 12 pairs of characteristic points of mouth and nose and 6 pairs of characteristic points of chin are levied, altogether 50 pairs of characteristic points are counted, and calculate characteristic point to the Euclidean distance between A and B, totally 50 dimension Euclidean distance feature, is denoted as G50, calculate Characteristic point to the formula (8) of the Euclidean distance between A and B as follows,
In formula (8), pAIt is characterized the coordinate set of point A, pBIt is characterized the coordinate set of point B, pA,xIt is characterized the abscissa of point A, pA,y It is characterized the ordinate of point A, pB,xIt is characterized the abscissa of point B, pB,yIt is characterized the ordinate of point B;
5.2nd step calculates the angle character of human face characteristic point:
10 angles that selection characterization face characteristic changes in T characteristic point of the key frame obtained from the screening of above-mentioned third step, Wherein 22 angles of eyebrow, 6 angles of eyes and mouth angles are calculated, and extract angle character, totally 10 dimension angle character, It is denoted as Q10, the formula (9) for calculating characteristic point angle is as follows,
In formula (9), pC、pD、pEIt is that eyebrow, eyes and mouth region that above-mentioned third step marks human face characteristic point are formed Three characteristic point coordinate sets corresponding to angle, wherein pDFor corner point coordinate set;
5.3rd step calculates human face region area features:
5 regions of facial image, including left and right eyebrow, two eyes and mouth are selected, this 5 regions are calculated separately Area features, due to the otherness of everyone human face size, here by the face of extracted 5 human face regions of key frame Product is corresponding with the area of extracted 5 human face regions of neutral frame to subtract each other, and obtains the variation characteristic of facial image region area, altogether 5 dimensions are denoted as O5, face brow region, face mouth region and face eye areas are set as triangle, utilize Helen's public affairs Formula calculates each triangle area, by the Euclidean distance feature G of human face characteristic point pair50, the angle character Q of human face characteristic point10With Human face region area features O5It combines shown in the geometrical characteristic F such as formula (10) as face,
F=[G50 Q10 O5] (10),
So far, series connection face textural characteristics and Face geometric eigenvector complete the extraction of human face expression feature;
6th step, the grouping of human face expression:
By six kinds of emotions of face: surprised, fear, is angry, detesting, fast happy sadness, be divided into three groups two-by-two, specific grouping is such as Under:
First group: surprised, fear;Second group: angry, detest;Third group: happy, sad;
7th step, the first subseries of facial expression recognition:
The human face expression feature that above-mentioned 4th step and the 5th step are extracted is put into ELM classifier and is trained and tests, by This completes the first subseries of facial expression recognition, obtains the recognition result of the first subseries of facial expression recognition, wherein ELM Parameter setting are as follows: ELM type: " classification ", hidden layer neuron number: " 20 ", activation primitive: " Sigmoid " function;
Process B. is using speech emotion recognition as second of Classification and Identification:
Process B is on the basis of the facial expression recognition result of process A, in conjunction with phonetic feature, respectively to the above-mentioned 6th It walks each group of three groups in the grouping of human face expression and carry out the of speech emotional feature extraction and speech emotion recognition Secondary classification, concrete operations are as follows:
8th step, the extraction of speech emotional feature:
For the classification results of the first subseries of above-mentioned 7th step facial expression recognition, according to the grouping of the 6th step, each group Emotion to the difference of the sensitivitys of different audio prosodic features, extract different prosodic features respectively:
First group: zero-crossing rate ZCR and logarithmic energy LogE is extracted,
Second group: Teager energy operator TEO, zero-crossing rate ZCR, logarithmic energy LogE are extracted,
Third group: extracting pitch Pitch, zero-crossing rate ZCR, Teager energy operator TEO,
Pitch Pitch is to calculate in a frequency domain in above-mentioned prosodic features,
Voice signal M pretreated for above-mentioned second step calculates pitch Pitch with following formula (11),
In formula (11), Pitch is pitch, and DFT is discrete Fourier transform function, LMThe length of voice signal is represented, It represents voice signal and adds Hamming window,Be calculated as shown in following formula (12),
In formula (12), N is the quantity of Hamming window, and m is m-th of Hamming window;
Shown in the calculating such as formula (13) of zero-crossing rate ZCR in above-mentioned prosodic features,
In formula (13), ZCR indicates the Average zero-crossing rate of N number of window, | | it is absolute value sign, X (m) is after framing adding window The voice signal of m-th of window, sgn { X (m) } function judge the positive and negative of speech amplitude, and sgn { X (m) } function is counted by formula (14) It calculates,
In formula (14), X (m) is the voice signal of m-th of window after framing adding window;
The calculation formula (15) of above-mentioned logarithmic energy LogE is as follows,
In formula (15), LogE indicates that total logarithmic energy of N number of window, X (m) are m-th of windows after framing adding window Voice signal, N are number of windows;
Teager energy operator TEO definition such as formula (16) is shown,
In formula (16), ψ [X (m)] be m-th of window Teager energy operator TEO, X ' (m)=dX (m)/dm, X " (m)= dX2(m)/dm2, for the signal of amplitude and frequency-invariant: X (m)=acos (φ m+ θ), wherein a is signal amplitude, and φ is signal Frequency, θ are signal initial phase angle,
The corresponding audio file of each group of picture frame of three groups in the grouping of above-mentioned 6th step human face expression is extracted public The mel-frequency cepstrum coefficient MFCC and its first-order difference feature and second differnce feature known, the rhythm for finally extracting each group Rule feature and corresponding mel-frequency cepstrum coefficient MFCC and the differential dtex of one second differnce feature of seeking peace are together in series shape At mixed audio frequency characteristics,
Thus the extraction of speech emotional feature is completed;
9th step, the second subseries of speech emotion recognition:
The speech emotional feature that above-mentioned 8th step is extracted is put into SVM to be trained and test, finally obtains speech emotional knowledge Other discrimination, the wherein parameter setting of SVM are as follows: penalty coefficient: " 95 " allow redundancy to export: " 0 ", nuclear parameter: " 1 " is supported The kernel function of vector machine: " Gaussian kernel ",
Thus the second subseries of speech emotion recognition is completed;
The fusion of process C. facial expression recognition and speech emotion recognition:
The fusion of tenth step, facial expression recognition and speech emotion recognition in decision level:
Due to the speech emotion recognition of above process B be carried out on the basis of the face emotion recognition of above process A it is secondary Identification, therefore the relationship of discrimination belongs to the relationship of conditional probability, final discrimination P (Audio_Visual) calculation method twice As shown in formula (17),
P (Audio_Visual)=P (Visual) × P (Audio | Visual) (17),
In formula (17), P (Visual) is the discrimination of first time facial image identification, and P (Audio/Visual) is second The discrimination of speech emotional;
So far the video of the progressive facial expression recognition of two processes based on decision level and speech emotion recognition fusion is completed Emotion recognition.
2. the video feeling recognition methods for according to claim 1 merging facial expression recognition and speech emotion recognition, It is characterized in that: the coordinate according to T characteristic point in the third step, wherein T=68.
CN201811272233.1A 2018-10-30 2018-10-30 Video emotion recognition method integrating facial expression recognition and voice emotion recognition Active CN109409296B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811272233.1A CN109409296B (en) 2018-10-30 2018-10-30 Video emotion recognition method integrating facial expression recognition and voice emotion recognition

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811272233.1A CN109409296B (en) 2018-10-30 2018-10-30 Video emotion recognition method integrating facial expression recognition and voice emotion recognition

Publications (2)

Publication Number Publication Date
CN109409296A true CN109409296A (en) 2019-03-01
CN109409296B CN109409296B (en) 2020-12-01

Family

ID=65470610

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811272233.1A Active CN109409296B (en) 2018-10-30 2018-10-30 Video emotion recognition method integrating facial expression recognition and voice emotion recognition

Country Status (1)

Country Link
CN (1) CN109409296B (en)

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109961054A (en) * 2019-03-29 2019-07-02 山东大学 It is a kind of based on area-of-interest characteristic point movement anxiety, depression, angry facial expression recognition methods
CN110363074A (en) * 2019-06-03 2019-10-22 华南理工大学 One kind identifying exchange method for complicated abstract class of things peopleization
CN110414335A (en) * 2019-06-20 2019-11-05 北京奇艺世纪科技有限公司 Video frequency identifying method, device and computer readable storage medium
CN110443143A (en) * 2019-07-09 2019-11-12 武汉科技大学 The remote sensing images scene classification method of multiple-limb convolutional neural networks fusion
CN110677598A (en) * 2019-09-18 2020-01-10 北京市商汤科技开发有限公司 Video generation method and device, electronic equipment and computer storage medium
CN110909613A (en) * 2019-10-28 2020-03-24 Oppo广东移动通信有限公司 Video character recognition method and device, storage medium and electronic equipment
CN111144197A (en) * 2019-11-08 2020-05-12 宇龙计算机通信科技(深圳)有限公司 Human identification method, device, storage medium and electronic equipment
CN111178389A (en) * 2019-12-06 2020-05-19 杭州电子科技大学 Multi-mode depth layered fusion emotion analysis method based on multi-channel tensor pooling
CN111553311A (en) * 2020-05-13 2020-08-18 吉林工程技术师范学院 Micro-expression recognition robot and control method thereof
CN111798874A (en) * 2020-06-24 2020-10-20 西北师范大学 Voice emotion recognition method and system
CN112101462A (en) * 2020-09-16 2020-12-18 北京邮电大学 Electromechanical device audio-visual information fusion method based on BMFCC-GBFB-DNN
CN112308102A (en) * 2019-08-01 2021-02-02 北京易真学思教育科技有限公司 Image similarity calculation method, calculation device, and storage medium
CN112418095A (en) * 2020-11-24 2021-02-26 华中师范大学 Facial expression recognition method and system combined with attention mechanism
CN112488219A (en) * 2020-12-07 2021-03-12 江苏科技大学 Mood consolation method and system based on GRU and mobile terminal
CN112766112A (en) * 2021-01-08 2021-05-07 山东大学 Dynamic expression recognition method and system based on space-time multi-feature fusion
CN112949560A (en) * 2021-03-24 2021-06-11 四川大学华西医院 Method for identifying continuous expression change of long video expression interval under two-channel feature fusion
CN113065449A (en) * 2021-03-29 2021-07-02 济南大学 Face image acquisition method and device, computer equipment and storage medium
CN113076847A (en) * 2021-03-29 2021-07-06 济南大学 Multi-mode emotion recognition method and system
CN113111789A (en) * 2021-04-15 2021-07-13 山东大学 Facial expression recognition method and system based on video stream
CN113128399A (en) * 2021-04-19 2021-07-16 重庆大学 Speech image key frame extraction method for emotion recognition
CN117577140A (en) * 2024-01-16 2024-02-20 北京岷德生物科技有限公司 Speech and facial expression data processing method and system for cerebral palsy children
CN110909613B (en) * 2019-10-28 2024-05-31 Oppo广东移动通信有限公司 Video character recognition method and device, storage medium and electronic equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1731416A (en) * 2005-08-04 2006-02-08 上海交通大学 Method of quick and accurate human face feature point positioning
CN105139004A (en) * 2015-09-23 2015-12-09 河北工业大学 Face expression identification method based on video sequences
CN105976809A (en) * 2016-05-25 2016-09-28 中国地质大学(武汉) Voice-and-facial-expression-based identification method and system for dual-modal emotion fusion
CN107704810A (en) * 2017-09-14 2018-02-16 南京理工大学 A kind of expression recognition method suitable for medical treatment and nursing
CN108682431A (en) * 2018-05-09 2018-10-19 武汉理工大学 A kind of speech-emotion recognition method in PAD three-dimensionals emotional space

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1731416A (en) * 2005-08-04 2006-02-08 上海交通大学 Method of quick and accurate human face feature point positioning
CN105139004A (en) * 2015-09-23 2015-12-09 河北工业大学 Face expression identification method based on video sequences
CN105976809A (en) * 2016-05-25 2016-09-28 中国地质大学(武汉) Voice-and-facial-expression-based identification method and system for dual-modal emotion fusion
CN107704810A (en) * 2017-09-14 2018-02-16 南京理工大学 A kind of expression recognition method suitable for medical treatment and nursing
CN108682431A (en) * 2018-05-09 2018-10-19 武汉理工大学 A kind of speech-emotion recognition method in PAD three-dimensionals emotional space

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
CHIEN SHING OOI.ET AL: ""A new approach of audio emotion recognition"", 《ELSEVIER》 *
TARUN KRISHNA.ET AL: "Emotion recognition using facial and audio features", 《ICMI "13: PROCEEDINGS OF THE 15TH ACM ON INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION》 *
卢官明等: ""基于LBP-TOP特征的微表情识别"", 《南京邮电大学学报(自然科学版)》 *
韩志艳: "《 面向语音与面部表情信号的多模式情感识别技术研究》", 31 January 2017 *

Cited By (28)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109961054A (en) * 2019-03-29 2019-07-02 山东大学 It is a kind of based on area-of-interest characteristic point movement anxiety, depression, angry facial expression recognition methods
CN110363074A (en) * 2019-06-03 2019-10-22 华南理工大学 One kind identifying exchange method for complicated abstract class of things peopleization
CN110414335A (en) * 2019-06-20 2019-11-05 北京奇艺世纪科技有限公司 Video frequency identifying method, device and computer readable storage medium
CN110443143A (en) * 2019-07-09 2019-11-12 武汉科技大学 The remote sensing images scene classification method of multiple-limb convolutional neural networks fusion
CN112308102A (en) * 2019-08-01 2021-02-02 北京易真学思教育科技有限公司 Image similarity calculation method, calculation device, and storage medium
CN110677598A (en) * 2019-09-18 2020-01-10 北京市商汤科技开发有限公司 Video generation method and device, electronic equipment and computer storage medium
CN110909613B (en) * 2019-10-28 2024-05-31 Oppo广东移动通信有限公司 Video character recognition method and device, storage medium and electronic equipment
CN110909613A (en) * 2019-10-28 2020-03-24 Oppo广东移动通信有限公司 Video character recognition method and device, storage medium and electronic equipment
CN111144197A (en) * 2019-11-08 2020-05-12 宇龙计算机通信科技(深圳)有限公司 Human identification method, device, storage medium and electronic equipment
CN111178389A (en) * 2019-12-06 2020-05-19 杭州电子科技大学 Multi-mode depth layered fusion emotion analysis method based on multi-channel tensor pooling
CN111178389B (en) * 2019-12-06 2022-02-11 杭州电子科技大学 Multi-mode depth layered fusion emotion analysis method based on multi-channel tensor pooling
CN111553311A (en) * 2020-05-13 2020-08-18 吉林工程技术师范学院 Micro-expression recognition robot and control method thereof
CN111798874A (en) * 2020-06-24 2020-10-20 西北师范大学 Voice emotion recognition method and system
CN112101462B (en) * 2020-09-16 2022-04-19 北京邮电大学 Electromechanical device audio-visual information fusion method based on BMFCC-GBFB-DNN
CN112101462A (en) * 2020-09-16 2020-12-18 北京邮电大学 Electromechanical device audio-visual information fusion method based on BMFCC-GBFB-DNN
CN112418095B (en) * 2020-11-24 2023-06-30 华中师范大学 Facial expression recognition method and system combined with attention mechanism
CN112418095A (en) * 2020-11-24 2021-02-26 华中师范大学 Facial expression recognition method and system combined with attention mechanism
CN112488219A (en) * 2020-12-07 2021-03-12 江苏科技大学 Mood consolation method and system based on GRU and mobile terminal
CN112766112B (en) * 2021-01-08 2023-01-17 山东大学 Dynamic expression recognition method and system based on space-time multi-feature fusion
CN112766112A (en) * 2021-01-08 2021-05-07 山东大学 Dynamic expression recognition method and system based on space-time multi-feature fusion
CN112949560B (en) * 2021-03-24 2022-05-24 四川大学华西医院 Method for identifying continuous expression change of long video expression interval under two-channel feature fusion
CN112949560A (en) * 2021-03-24 2021-06-11 四川大学华西医院 Method for identifying continuous expression change of long video expression interval under two-channel feature fusion
CN113076847A (en) * 2021-03-29 2021-07-06 济南大学 Multi-mode emotion recognition method and system
CN113065449A (en) * 2021-03-29 2021-07-02 济南大学 Face image acquisition method and device, computer equipment and storage medium
CN113111789A (en) * 2021-04-15 2021-07-13 山东大学 Facial expression recognition method and system based on video stream
CN113128399A (en) * 2021-04-19 2021-07-16 重庆大学 Speech image key frame extraction method for emotion recognition
CN117577140A (en) * 2024-01-16 2024-02-20 北京岷德生物科技有限公司 Speech and facial expression data processing method and system for cerebral palsy children
CN117577140B (en) * 2024-01-16 2024-03-19 北京岷德生物科技有限公司 Speech and facial expression data processing method and system for cerebral palsy children

Also Published As

Publication number Publication date
CN109409296B (en) 2020-12-01

Similar Documents

Publication Publication Date Title
CN109409296A (en) The video feeling recognition methods that facial expression recognition and speech emotion recognition are merged
CN108805087B (en) Time sequence semantic fusion association judgment subsystem based on multi-modal emotion recognition system
CN108877801B (en) Multi-turn dialogue semantic understanding subsystem based on multi-modal emotion recognition system
CN108899050B (en) Voice signal analysis subsystem based on multi-modal emotion recognition system
CN108805088B (en) Physiological signal analysis subsystem based on multi-modal emotion recognition system
CN108717856B (en) Speech emotion recognition method based on multi-scale deep convolution cyclic neural network
Guanghui et al. Multi-modal emotion recognition by fusing correlation features of speech-visual
CN109614895A (en) A method of the multi-modal emotion recognition based on attention Fusion Features
Yang et al. Feature augmenting networks for improving depression severity estimation from speech signals
CN113158727A (en) Bimodal fusion emotion recognition method based on video and voice information
Ocquaye et al. Dual exclusive attentive transfer for unsupervised deep convolutional domain adaptation in speech emotion recognition
Alshamsi et al. Automated facial expression and speech emotion recognition app development on smart phones using cloud computing
CN113643723A (en) Voice emotion recognition method based on attention CNN Bi-GRU fusion visual information
Shinde et al. Real time two way communication approach for hearing impaired and dumb person based on image processing
Saeed et al. Automated facial expression recognition framework using deep learning
Liang Intelligent emotion evaluation method of classroom teaching based on expression recognition
Jaratrotkamjorn et al. Bimodal emotion recognition using deep belief network
CN107437090A (en) The continuous emotion Forecasting Methodology of three mode based on voice, expression and electrocardiosignal
Huijuan et al. Coarse-to-fine speech emotion recognition based on multi-task learning
Atkar et al. Speech Emotion Recognition using Dialogue Emotion Decoder and CNN Classifier
Turker et al. Audio-facial laughter detection in naturalistic dyadic conversations
Li et al. A novel art gesture recognition model based on two channel region-based convolution neural network for explainable human-computer interaction understanding
Datcu et al. Multimodal recognition of emotions in car environments
Chelali Bimodal fusion of visual and speech data for audiovisual speaker recognition in noisy environment
Fu et al. An adversarial training based speech emotion classifier with isolated gaussian regularization

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant