CN110516696A

CN110516696A - It is a kind of that emotion identification method is merged based on the adaptive weighting bimodal of voice and expression

Info

Publication number: CN110516696A
Application number: CN201910632006.3A
Authority: CN
Inventors: 肖婧; 黄永明
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2019-07-12
Filing date: 2019-07-12
Publication date: 2019-11-29
Anticipated expiration: 2039-07-12
Also published as: CN110516696B

Abstract

Emotion identification method is merged based on the adaptive weighting bimodal of voice and human face expression the present invention relates to a kind of, the following steps are included: obtaining emotional speech and human face expression data, affection data is corresponding with emotional category, and choose training sample set test sample collection；Speech emotional feature is extracted to voice data, dynamic expression feature is extracted to expression data；Speech emotional feature and expressive features are based respectively on, are learnt using the deep learning method based on semi-supervised autocoder, classification results and output probability of all categories are obtained by softmax classifier；Two kinds of single mode emotion recognition results are finally subjected to Decision-level fusion, using a kind of adaptive weighted method, obtain final emotion recognition result.The present invention is directed to the otherness of personal different modalities affective characteristics characterization ability in fact, takes adaptive weighting fusion method, has higher accuracy and objectivity.

Description

It is a kind of that emotion identification method is merged based on the adaptive weighting bimodal of voice and expression

Technical field

The present invention relates to the emotion recognition fields in affection computation, and in particular to adaptive based on voice and human face expression Weight bimodulus merges emotion identification method.

Background technique

In recent years, under the development of artificial intelligence and robot technology, traditional interactive mode is no longer satisfied Demand, novel human-computer interaction need the exchange of emotion, and therefore, emotion recognition becomes the key of human-computer interaction technology development, Also become the research topic of educational circles's hot spot.Emotion recognition is to be related to multi-disciplinary research topic, by making computer understanding simultaneously It identifies human emotion, and then predicts and understand the behavior trend and psychological condition of the mankind, to realize efficiently harmonious man-machine feelings Sense interaction.

The mood of people has various expression ways, such as voice, expression, posture, text, we can therefrom mention Effective information is taken, thus Correct Analysis mood.And expression and voice messaging be the most obvious and the spy that most easily analyzes Sign, is widely studied and applied.Psychologist Mehrabian gives formula: the words of display of emotion=7% The facial expression of+38% sound+55%, it is seen that the voice messaging and human facial expression information of people cover 93% emotion information, It is the core in Human communication's information.During emotion expression service, facial deformation can effectively and intuitively give expression to heart Emotion, is mostly important one of the characteristic information of emotion recognition, and phonetic feature can similarly give expression to emotion abundant.

Due to the development of internet in recent years and emerging one after another for various social medias, the exchange way of people has been obtained very Big abundant, such as video, audio etc. makes it possible multi-modal emotion recognition.There may be lists for traditional single mode identification One affective characteristics cannot characterize the problem of affective state well, for example, people are in the sad emotion of expression, facial expression It may not have a greater change, but at this point, the sad emotion lost can be told from droning and low and slow voice.Multi-modal knowledge Do not make the information of different modalities complementation may be implemented, provides more emotion informations for emotion recognition, improve the standard of emotion recognition True rate.But currently, single mode emotion recognition research is more mature, for multi-modal emotion identification method, there are also to be developed and complete It is kind.Therefore, multi-modal emotion recognition has highly important practical application meaning.And it is special as expression the most dominant and voice Sign, the bimodulus emotion recognition based on the two have important research significance and practical value.Traditional method of weighting has ignored a People's otherness, therefore, it is necessary to a kind of methods of adaptive weighting to carry out weight distribution.

Summary of the invention

Emotion recognition is merged based on the adaptive weighting bimodulus of voice and human face expression the object of the present invention is to provide a kind of Method to realize the complementation of each modal information, and realizes that the adaptive weighting for individual differences distributes.

For this purpose, the invention adopts the following technical scheme:

A kind of recognition methods based on the fusion of the adaptive weighting bimodal of voice and human face expression, which is characterized in that institute State method the following steps are included:

S1, emotional speech and human face expression data are obtained, affection data is corresponding with emotional category, and choose trained sample This set test sample collection,

S2, speech emotional feature is extracted to voice data, dynamic expression feature is extracted to expression data, is automatically extracted first Expression peak value frame, obtain expression start the dynamic image sequence to expression peak value, after random length image sequence is normalized to The image sequence of fixed length, as dynamic expression feature,

S3, speech emotional feature and expressive features are based respectively on, using the deep learning based on semi-supervised autocoder Method is learnt, and obtains classification results and output probability of all categories by softmax classifier,

S4, two kinds of single mode emotion recognition results are subjected to Decision-level fusion, the side distributed using a kind of adaptive weighting Method obtains final emotion recognition result.

Further, specific step is as follows by step S2 described above:

S2A.1: for speech emotional data, the speech samples section of acquisition is subjected to sub-frame processing, is divided into multiframe voice Section, and windowing process is carried out to the voice segments after framing, speech emotional signal is obtained,

S2A.2: the speech emotional signal obtained for S2A.1 extracts low-level features and extracts in frame level, fundamental tone F0, short-time energy, lock in phenomenon Shimmer, it is humorous make an uproar than and Mel cepstrum coefficient etc.,

S2A.3: the low-level features obtained to step 1 frame level are united in the speech samples level of multiframe composition Meter is applied to multiple statistical functions, maximum value, minimum value, average value, standard deviation etc., obtains speech emotional feature；

S2B.1: for human face expression data, firstly, the human face expression characteristic point three-dimensional coordinate data that will acquire, is sat Mark variation, by point centered on nose, obtain spin matrix using SVD principle, multiply spin matrix carry out it is rotationally-varying, with eliminate The influence of head pose variation.

S2B.2: peak value expression frame is extracted using slow characteristic analysis method, the specific steps are as follows:

1) each dynamic image sequence sample is considered as time input signal

2) willIt is normalized, so that difference is 0, variance 1,

X (t)=[x₁(t),x₂(t),…,x_I(t)]^T；

3) input signal is subjected to nonlinear extensions extension, converts linear SFA problem for problem,

4) Data Whitening is carried out；

5) linear SFA method solves.

S2B.3: after obtaining expression start frame to the dynamic expression sequence of expression peak value frame, the non-fixed length of linear interpolation method is utilized Behavioral characteristics be normalized.

Further, specific step is as follows by step S3 described above:

S3.1: being directed to a certain modal data, and input is without label and has label to input training sample, compiles by self-encoding encoder Code, decoding and the output of softmax classifier generate reconstruct data and classification output respectively,

S3.2: calculating unsupervised learning indicates reconstructed error and supervised learning error in classification,

S3.3: constitution optimization objective function, while considering reconstructed error and error in classification,

E (θ)=α E_r+(1-α)E_c；

S3.4: gradient descent method undated parameter, until objective function is restrained.

Further, specific step is as follows by step S4 described above:

S4.1: all kinds of output probabilities of softmax classifier test sample both modalities which respectively are obtained, variable δ is calculated_k, δ_k It can be used to measure the quality that the mode characterizes emotion, according to each sample δ_kDifferent adaptive points for realizing weight of size Match, wherein J is the number of class in system, and P is the vector of sample output probability composition.P={ p_j| j=1 ..., J }, p_jFor The output of softmax classifier belongs to probability of all categories, and d indicates the Euclidean distance between two vectors.

S4.2: by δ_kIt is mapped between [0,1] according to the following formula, as weight, wherein a and b is auto-selecting parameter, according to tool Body situation determines.,

S4.3: P in fused output probability vector is obtained according to the following formula_final={ p_{final_j}| j=1 ..., J }, it is maximum Probability generic is to identify classification.p_{j_k}For the jth kind classification for carrying out the acquisition of single mode emotion recognition using kth kind mode Probability output, total K kind mode.

Compared with the existing technology, beneficial effects of the present invention are as follows: the present invention is based on the adaptive of voice and human face expression The emotion identification method of weight bimodulus fusion is based on standard database and achieves more accurate and efficient recognition effect, for a People's different modalities affective characteristics characterize the otherness of ability, take adaptive weighting fusion method, have higher accuracy And objectivity, it is based on IEMOCAP emotion library, achieves 83% discrimination, distributes, is achieved about compared to traditional fixed weight 3% discrimination is promoted.

Detailed description of the invention

Fig. 1 is recognition methods overall procedure schematic diagram of the invention.

Fig. 2 is the flow diagram of step S3 of the present invention.

Fig. 3 is adaptive weighting allocation process diagram of the present invention.

Specific embodiment

Principles and features of the present invention are described with reference to the accompanying drawing, the given examples are served only to explain the present invention, and It is non-to be used to limit the scope of the invention.

Embodiment 1: referring to figures 1-3, a kind of knowledge based on the fusion of the adaptive weighting bimodal of voice and human face expression Other method, the described method comprises the following steps:

Further, specific step is as follows by step S2 described above:

S2A.3: the low-level features obtained to step 1 frame level are united in the speech samples level of multiframe composition Meter is applied to multiple statistical functions, maximum value, minimum value, average value, standard deviation etc., obtains speech emotional feature,

1) each dynamic image sequence sample is considered as time input signal

2) willIt is normalized, so that difference is 0, variance 1,

X (t)=[x₁(t),x₂(t),…,x_I(t)]^T

4) Data Whitening is carried out；

5) linear SFA method solves.

Further, specific step is as follows by step S3 described above:

E (θ)=α E_r+(1-α)E_c；

Further, specific step is as follows by step S4 described above:

S4.1: all kinds of output probabilities of softmax classifier test sample both modalities which respectively are obtained, variable δ is calculated_k, δ_k It can be used to measure the quality that the mode characterizes emotion, according to each sample δ_kDifferent adaptive points for realizing weight of size Match, wherein J is the number of class in system.P is the vector of sample output probability composition.P={ p_j| j=1 ..., J }, p_jFor The output of softmax classifier belongs to probability of all categories, and d indicates the Euclidean distance between two vectors.

S4.2: by δ_kIt is mapped between [0,1] according to the following formula, as weight, wherein a and b is auto-selecting parameter.

Application Example: referring to figures 1-3, using IEMOCAP affection data library as material, emulation platform is this example MATLAB R2014a。

As shown in Figure 1, the present invention is based on the emotion identification method that voice and the adaptive weighting bimodulus of expression merge is main The following steps are included:

S1, emotional speech and human face expression data are obtained, affection data is corresponding with emotional category, and choose trained sample This set test sample collection.Choose neutral, glad, sad, angry four class emotional categories.

S2, speech emotional feature is extracted to voice data.Dynamic expression feature is extracted to expression data, is automatically extracted first Expression peak value frame, obtain expression start the dynamic image sequence to expression peak value, after random length image sequence is normalized to The image sequence of fixed length, as dynamic expression feature.Extraction for phonetic feature is the speech feature extraction work using open source Tool case openSMILE is extracted 2010 Paralinguistic Challenge standard feature collection of INTERSPEECH, and totally 1582 Dimensional feature.Extraction for human face expression behavioral characteristics.Peak value expression frame is extracted using slow characteristic analysis method.Given threshold afterwards Expression start frame is found, it is non-fixed using linear interpolation method after obtaining expression start frame to the dynamic expression sequence of expression peak value frame Long behavioral characteristics are normalized.

S3, speech emotional feature and expressive features are based respectively on, using the deep learning based on semi-supervised autocoder Method is learnt, and obtains classification results and output probability of all categories by softmax classifier.

As shown in Fig. 2, the step S3 semisupervised classification specific steps are as follows:

S3.1: it is directed to a certain modal data, input is without label and has label to input training sample.It is compiled by self-encoding encoder Code, decoding and the output of softmax classifier generate reconstruct data and classification output respectively.

S3.2: calculating unsupervised learning indicates reconstructed error and supervised learning error in classification.

S3.3: constitution optimization objective function, while considering reconstructed error and error in classification.

E (θ)=α E_r+(1-α)E_c

As shown in figure 3, specific step is as follows by the step S4:

S4.1: all kinds of output probabilities of softmax classifier test sample both modalities which respectively are obtained.Calculate variable δ_k, δ_k It can be used to measure the quality that the mode characterizes emotion, according to each sample δ_kDifferent adaptive points for realizing weight of size Match.Wherein, J is the number of class in system.P is the vector of sample output probability composition.P={ p_j| j=1 ..., J }, p_jFor The output of softmax classifier belongs to probability of all categories, and d indicates the Euclidean distance between two vectors.

S4.2: by δ_kIt is mapped between [0,1] according to the following formula, as weight.Wherein, a and b is auto-selecting parameter.

S4.3: P in fused output probability vector is obtained according to the following formula_final={ p_{final_j}| j=1 ..., J }, it is maximum Probability generic is to identify classification, p_{j_k}For the jth kind classification for carrying out the acquisition of single mode emotion recognition using kth kind mode Probability output, total K kind mode.

It should be noted that above-described embodiment is only presently preferred embodiments of the present invention, there is no for the purpose of limiting the invention Protection scope, the equivalent substitution or substitution made based on the above technical solution, all belongs to the scope of protection of the present invention.

Claims

1. a kind of merge emotion identification method based on the adaptive weighting bimodal of voice and human face expression, which is characterized in that institute The method of stating includes the following steps:

S1, emotional speech data and human face expression data are obtained, affection data is corresponding with emotional category, and choose trained sample This set test sample collection；

S2, speech emotional feature is extracted to voice data, dynamic expression feature is extracted to expression data, automatically extracts expression first Peak value frame obtains expression and starts the dynamic image sequence to expression peak value, after random length image sequence is normalized to fixed length Image sequence, as dynamic expression feature；

S3, speech emotional feature and expressive features are based respectively on, using the deep learning method based on semi-supervised autocoder Learnt, classification results and output probability of all categories are obtained by softmax classifier；

S4, two kinds of single mode emotion recognition results are subjected to Decision-level fusion, a kind of method distributed using adaptive weighting is obtained To final emotion recognition result.

2. according to claim 1 merge emotion identification method based on the bimodal of voice and human face expression, feature exists In the specific steps of the step S2 affective feature extraction are as follows:

S2A.1: for speech emotional data, carrying out sub-frame processing for the speech samples section of acquisition, be divided into multiframe voice segments, and Windowing process is carried out to the voice segments after framing, obtains speech emotional signal,

S2A.2: the speech emotional signal obtained for S2A.1 extracts low-level features and extracts in frame level, fundamental tone F0, short Shi Nengliang, lock in phenomenon Shimmer, it is humorous make an uproar than and Mel cepstrum coefficient etc.,

S2A.3: the low-level features obtained to step 1 frame level are counted in the speech samples level of multiframe composition, Multiple statistical functions, maximum value, minimum value, average value, standard deviation etc. are applied to, speech emotional feature is obtained；

S2B.1: for human face expression data, firstly, the human face expression characteristic point three-dimensional coordinate data that will acquire, carries out coordinate change Change, by point centered on nose, obtain spin matrix using SVD principle, multiply spin matrix carry out it is rotationally-varying, to eliminate head The influence of attitudes vibration,

S2B.2: extracting peak value expression frame using slow characteristic analysis method,

S2B.3: random length dynamic using linear interpolation method after obtaining expression start frame to the dynamic expression sequence of expression peak value frame State feature is normalized.

3. according to claim 1 merge emotion identification method based on the bimodal of voice and human face expression, feature exists In the specific steps of the step S3 semi-supervised learning are as follows:

S3.1: being directed to a certain modal data, and input is without label and has label to input training sample, encodes by self-encoding encoder, solves Code and the output of softmax classifier generate reconstruct data and classification output respectively,

E (θ)=α E_r+(1-α)E_c；

4. according to claim 1 merge emotion recognition side based on the adaptive weighting bimodal of voice and human face expression Method, which is characterized in that Decision-level fusion step of the step S4 based on adaptive weighting are as follows:

S4.1: all kinds of output probabilities of softmax classifier test sample both modalities which respectively are obtained, variable δ is calculated_k, δ_kIt can use The quality that the mode characterizes emotion is measured, according to each sample δ_kThe different self-adjusted blocks for realizing weight of size, In, J is the number of class in system, and P is the vector of sample output probability composition, P={ p_j| j=1 ..., J }, p_jIt is softmax points The output of class device belongs to probability of all categories, and d indicates the Euclidean distance between two vectors；

S4.2: by δ_kIt being mapped between [0,1] according to the following formula, as weight, wherein a and b is auto-selecting parameter,

S4.3: P in fused output probability vector is obtained according to the following formula_final={ p_{final_j}| j=1 ..., J }, maximum probability Generic is to identify classification, p_{j_k}It is other general using the jth type of kth kind mode progress single mode emotion recognition acquisition Rate output, total K kind mode；

5. according to claim 2 merge emotion recognition side based on the adaptive weighting bimodal of voice and human face expression Method, which is characterized in that S2B.2: peak value expression frame is extracted using slow characteristic analysis method, the specific steps are as follows:

1) each dynamic image sequence sample is considered as time input signal

2) willIt is normalized, so that difference is 0, variance 1,

X (t)=[x₁(t),x₂(t),…,x_I(t)]^T

4) Data Whitening is carried out；

5) linear SFA method solves.