CN110619301A

CN110619301A - Emotion automatic identification method based on bimodal signals

Info

Publication number: CN110619301A
Application number: CN201910868310.8A
Authority: CN
Inventors: 王峰; 牛锦; 魏祥; 宋剑桥; 相虎生; 王飞
Original assignee: Dao An Bang Tianjin Security Technology Co Ltd
Current assignee: Dao An Bang Tianjin Security Technology Co Ltd
Priority date: 2019-09-13
Filing date: 2019-09-13
Publication date: 2019-12-27
Anticipated expiration: 2039-09-13
Also published as: CN110619301B

Abstract

The invention discloses an emotion automatic identification method based on bimodal signals, which comprises the steps of cutting and framing video data containing expression actions, extracting a facial expression picture sequence, extracting LBP-TOP characteristics of the facial expression picture sequence, extracting pulse wave signals of the facial expression picture sequence based on a chromaticity model, and extracting time domain and frequency domain characteristics of the pulse wave signals; fusing the LBP-TOP characteristics of the extracted facial expression picture sequence with the time domain and frequency domain characteristics of the pulse wave signal; and dividing the fused facial expression images into a training set and a testing set, inputting the training set into a support vector machine for training and optimizing, and then supporting the support vector machine to realize automatic emotion recognition in the facial expression images. By the invention, the complexity of the system is greatly reduced and the convenience of the system is improved; the fused features avoid the problem of low recognition accuracy caused by artificial deliberate emotion masking or no obvious facial expression change.

Description

Emotion automatic identification method based on bimodal signals

Technical Field

The invention relates to the technical field of image processing, in particular to an automatic emotion recognition method based on a bimodal signal.

Background

The emotion recognition technology is becoming mature with the updating of instruments and equipment and the development of artificial intelligence, and is widely applied to various fields such as clinical medicine, emotional intelligence, national security, political psychology and the like. The existing emotion recognition method mainly comprises two types, namely physical sign detection based on a precision instrument, emotion recognition classification achieved by detecting physiological signals such as human brain electricity, electrocardio and pulse waves, and intelligent facial emotion recognition detection based on machine learning, wherein emotion recognition is achieved by mainly capturing motion conditions of facial muscles of a human face, such as the situation that a mouth is raised when a user is happy.

However, the two methods have respective advantages and disadvantages, the first method often requires expensive and complex equipment, although the detection result has high accuracy, the detection cost is high, the methods both belong to a contact type method, the data acquisition process is complex, the labor is consumed, and discomfort is easily brought to a testee, so that the method has certain limitation in practical application, is not suitable for large-scale popularization, and is often used in some special scenes, such as emotion detection of astronauts and emotion detection of soldiers after participation in major rescue; the second method is a commonly used identification and detection means at present, and the method does not need expensive equipment and is simple to operate. However, the accuracy of the detected recognized emotion cannot be guaranteed, although the facial expression can visually display the emotion change, many intrinsic emotion change processes are not perceived along with visual facial activities, people can mask and hide their emotion experiences, and therefore the meanings of the expression are misunderstood by observers, and emotion recognition accuracy is affected.

Disclosure of Invention

The technical problem to be solved by the present invention is to provide an automatic emotion recognition method based on bimodal signals, aiming at the above-mentioned defects in the prior art.

The technical scheme adopted by the invention for solving the technical problems is as follows: the method for automatically recognizing the emotion based on the bimodal signals comprises the following steps:

the method comprises the following steps: cutting and framing video data containing expression actions, extracting a facial expression picture sequence from the beginning to the end of expression, and preprocessing the extracted facial expression picture sequence; wherein, the preprocessing mode at least comprises geometric correction and normalization;

step two: extracting LBP-TOP characteristics of the facial expression picture sequence;

step three: extracting pulse wave signals of the facial expression picture sequence based on the chrominance model, and extracting time domain and frequency domain characteristics of the pulse wave signals;

step four: fusing the LBP-TOP characteristics of the extracted facial expression picture sequence with the time domain and frequency domain characteristics of the pulse wave signal;

step five: dividing the fused facial expression images into a training set and a testing set, inputting the training set into a support vector machine for training, carrying out test optimization through the testing set after the training is finished, and realizing automatic emotion recognition in the facial expression images through the support vector machine after the training optimization.

Further, the step of extracting the LBP-TOP characteristics of the facial expression picture sequence is as follows:

A. converting the normalized sequence of the facial expression pictures into a gray scale image;

B. setting LBP-TOP extraction parameters, including selection of the number of face blocks, selection of radii of an X axis, a Y axis and a T axis, the number of adjacent points p and selection of an LBP mode;

C. calculating LBP values of an XY plane, an XT plane and a YT plane respectively, and connecting the LBP values of the three planes in series to obtain an LBP-TOP characteristic, wherein the calculation formula of the LBP value of each plane is as follows:

wherein (x)_c，y_c): the positions of the pixel points are the positions of the pixel points,the gray value of the pixel point is set,is the gray value of the central pixel point, and p is the number of the adjacent points of the pixel point.

Further, pulse wave signals of the facial expression picture sequence are extracted based on the chromaticity model, and time domain and frequency domain features of the pulse wave signals are extracted, and the method specifically comprises the following steps:

A. performing frame-by-frame face detection on cut video data containing expression motions by adopting a detection method of comprehensive AdaBoost and Cascade, selecting a face area with eyes and mouth removed as an interested area, avoiding the eyes and mouth area, and increasing the area of the interested area on the premise of ensuring the area to be a pure skin area so as to avoid the influence of eye blinking and mouth motions on pulse signal extraction;

B. the pulse wave signal extraction based on the chromaticity model is to eliminate static components, motion interference and diffuse reflection interference by utilizing the transformation of different color channel information difference values and proportions, and the change of the skin reflected light intensity caused by the blood volume change caused by the pulse fluctuation is reflected as the change of the brightness information in the acquired human face expression picture sequence; the brightness change is obtained by averaging the gray values of all the pixels, and the brightness information of all the color channels C E { R, G, B } of the image is represented as:

wherein N is the image serial number, N is the number of pictures, C (N) is the region of interest, R, G, B three-channel one-dimensional signal, c_n(i, j) is the color gray value of the channel corresponding to the pixel point (i, j), and h and w respectively represent the height and width of the region of interest;

for each frame image, the change of luminance information of each color channel C ∈ { R, G, B } is expressed as:

wherein the subscript I denotes the current frame number, I_ciRepresenting the intensity of illumination during the exposure time of the camera,representing the coefficient of the static component of the light reflected by the skin surface,a dynamic change component s representing reflected light caused by a change in blood volume due to pulse beat_iThe additive specular reflection component is shown, and the R, G and B channels are completely the same;

normalizing each color channel information sequence over a period of time to eliminateThe specific formula is as follows:

wherein C (N) represents the channel information of R, G and B in a period of time, N represents the image sequence label in the current time, N is the total number of images,the average value of the brightness information of each color channel C (n) in the current time;

defining the chrominance signal:

X_s＝2R₁(n)-3G₁(n)

Y_s＝1.5R₁(n)+G₁(n)-1.5B₁(n)

wherein R is₁(n)，G₁(n)，B₁(n) is the normalized color channel signal;

to X_sAnd Y_sObtaining X through a band-pass filter (0.7Hz-4Hz)_fAnd Y_fAnd extracting the pulse wave signal S by the following equation:

S＝X_f-αY_f

where σ (·) represents the standard deviation of the signal; thereby eliminating interference of diffuse reflection and static components;

extracting time domain characteristics including a mean value, a standard deviation, an absolute value mean value of a first-order difference signal, an absolute value mean value of a second-order difference signal and an absolute value mean value of a normalized difference signal from the pulse waves, performing five-point moving smoothing filtering on the obtained pulse waves and removing heteropulsation, then detecting a main wave peak of the waveform, calculating a time interval of adjacent main wave peaks, namely a P-P interval, removing the pulse waves with the time interval less than 50ms, drawing a normal P-P interval to obtain a pulse variation signal, extracting the mean value and the standard deviation from the pulse variation signal, counting the number of P-P intervals more than 50ms, calculating the percentage of the P-P intervals more than 50ms, and calculating the root mean square of the difference value of the P-P intervals;

extracting frequency domain characteristics of pulse waves, dividing an original signal (0.7Hz-4Hz) into 6 non-overlapping sub-bands by using typical 1024-point fast Fourier transform, and respectively calculating the power spectral entropy of each sub-band, wherein the calculation formula is as follows:

p(ω_i) The power spectral densities of different sub-bands are normalized; taking the first three sub-bands of the 6 sub-bands as low frequency bands and the last three sub-bands as high frequency bands, and calculating power spectrum entropy ratios of the high frequency band and the low frequency band; carrying out cubic spline interpolation on the pulse variation signal, refining pulse wave peak points, reserving signal instantaneous characteristics by removing signal mean, carrying out Fourier transform analysis on pulse variation signal frequency domain characteristics, and respectively calculating very low frequency power, wherein the calculation formula is as follows:

where PSD (f) is the signal power spectral density, f₁And f₂Respectively obtaining low-frequency power, high-frequency power, total power, a ratio of the low-frequency power to the high-frequency power, a ratio of the low-frequency power to the total power and a ratio of the high-frequency power to the total power by using the same principle for real frequency.

Further, the step of fusing the extracted LBP-TOP characteristics of the facial expression picture sequence with the time domain and frequency domain characteristics of the pulse wave signal is as follows:

fusing the LBP-TOP characteristics with the time-frequency domain characteristics of the physiological signals through typical correlation analysis to obtain new characteristics including expression signals and the physiological signals;

the CCA algorithm finds the corresponding basis vector w for the sample set X, Y_x∈R^q，w_y∈R^pMake variableMaximum, the specific algorithm is expressed as finding the maximum of the correlation coefficient:

wherein ∑₁₁Is the covariance matrix of X, Σ₂₂Is the covariance matrix of Y, Σ₁₂＝cov(X，Y)，∑₂₁Is₁₂Is transposed and solved to obtain w_x，w_yChange the current stateAnd as the combined features after projection, fusion of the two types of features is realized.

Further, inputting the training set into a support vector machine for training, and after the training is finished, performing test optimization through the test set, wherein the test optimization steps comprise:

selecting a radial vector RBF kernel function, wherein the radial vector kernel function can map samples in a nonlinear way, and has smaller numerical complexity, and the kernel function formula is as follows:

K(x_i，x_j)＝exp(-γ||x_i-x_j||²)

wherein gamma is more than 0, the default value is 1/k, and k is the number of categories;

determining two parameters of a penalty factor C and cross validation times, wherein the selection of a C value has important influence on classification accuracy, and the larger the C value is, the larger the penalty for errors is, but the overfitting is caused if the C value is too large, so that the C value needs to be properly selected;

and training a support vector machine by using the training set data, calculating the recognition rate by using the test set, finishing the training when the recognition result meets the expected requirement, otherwise optimizing a punishment factor C, and continuing the training until the training effect meets the expected requirement.

The method for automatically identifying the emotion based on the bimodal signals has the following beneficial effects:

compared with the traditional emotion recognition technology needing wearing equipment, the method only needs to record videos containing different emotions;

compared with the traditional method with a single signal source, the method comprehensively utilizes the facial expression signals and the pulse signals, realizes emotion recognition based on multi-source information characteristic fusion, and avoids the problem of low recognition precision caused by artificial deliberate emotion masking or no obvious facial expression change;

compared with the inconvenience of traditional pulse signal acquisition and feature extraction, the pulse wave signal and the features thereof are acquired in a non-contact manner, so that the complexity of the system is greatly reduced and the convenience of the system is improved;

the LBP-TOP expression characteristic and the pulse signal characteristic are fused based on the typical correlation analysis (CCA), and a support vector machine is trained to realize final classification.

Drawings

Fig. 1 is a schematic flow chart of an emotion automatic identification method based on a bimodal signal provided by the present invention.

Fig. 2 is a schematic diagram of an LBP-TOP expression feature extraction process of an emotion automatic identification method based on a bimodal signal provided by the invention.

Fig. 3 is a schematic flow chart of extracting a pulse signal based on a chromaticity model in the automatic emotion recognition method based on a bimodal signal provided by the invention.

FIG. 4 is a schematic flow chart of feature fusion and classification of an emotion automatic identification method based on bimodal signals provided by the present invention.

FIG. 5 is a schematic flow chart of the support vector machine classification recognition model establishment of the automatic emotion recognition method based on bimodal signals.

Detailed Description

For a more clear understanding of the technical features, objects and effects of the present invention, embodiments of the present invention will now be described in detail with reference to the accompanying drawings.

As shown in fig. 1, the method for automatically recognizing emotion based on bimodal signals provided by the present invention comprises the following steps:

The method comprises the following steps of extracting LBP-TOP characteristics of a facial expression picture sequence:

converting the normalized sequence of the facial expression pictures into a gray scale image;

setting LBP-TOP extraction parameters, including selection of the number of face blocks, selection of radii of an X axis, a Y axis and a T axis, the number of adjacent points p and selection of an LBP mode;

calculating LBP values of an XY plane, an XT plane and a YT plane respectively, and connecting the LBP values of the three planes in series to obtain an LBP-TOP characteristic, wherein the calculation formula of the LBP value of each plane is as follows:

The method comprises the following steps of extracting pulse wave signals of a facial expression picture sequence based on a chrominance model, and extracting time domain and frequency domain characteristics of the pulse wave signals, wherein the specific steps are as follows:

performing frame-by-frame face detection on cut video data containing expression motions by adopting a detection method of comprehensive AdaBoost and Cascade, selecting a face area with eyes and mouth removed as an interested area, avoiding the eyes and mouth area, and increasing the area of the interested area on the premise of ensuring the area to be a pure skin area so as to avoid the influence of eye blinking and mouth motions on pulse signal extraction;

the pulse wave signal extraction based on the chromaticity model is to eliminate static components, motion interference and diffuse reflection interference by utilizing the transformation of different color channel information difference values and proportions, and the change of the skin reflected light intensity caused by the blood volume change caused by the pulse fluctuation is reflected as the change of the brightness information in the acquired human face expression picture sequence; the brightness change is obtained by averaging the gray values of all the pixels, and the brightness information of all the color channels C E { R, G, B } of the image is represented as:

wherein, the subscript i represents the current frame number,representing the intensity of illumination during the exposure time of the camera,representing the coefficient of the static component of the light reflected by the skin surface,a dynamic change component s representing reflected light caused by a change in blood volume due to pulse beat_iThe additive specular reflection component is shown, and the R, G and B channels are completely the same;

wherein C (n) represents the information of R, G, B three color channels in a period of time, n representsThe image sequence number in the current time, N is the total number of images,the average value of the brightness information of each color channel C (n) in the current time;

defining the chrominance signal:

X_s＝2R₁(n)-3G₁(n)

Y_s＝1.5R₁(n)+G₁(n)-1.5B₁(n)

wherein R is₁(n)，G₁(n)，B₁(n) is the normalized color channel signal;

S＝X_f-αY_f

The method comprises the following steps of extracting LBP-TOP characteristics of a facial expression picture sequence, and fusing time domain and frequency domain characteristics of a pulse wave signal:

wherein ∑₁₁Is the covariance matrix of X, Σ₂₂Is the covariance matrix of Y，∑₁₂=cov(X，Y)，∑₂₁Is₁₂Is transposed and solved to obtain w_x，w_yChange the current stateAnd as the combined features after projection, fusion of the two types of features is realized.

The training set is input into a support vector machine for training, and the step of testing and optimizing through the testing set after the training is finished comprises the following steps:

K(x_i，x_j)＝exp(-γ||x_i-x_j||²)

The following case is used for specific explanation.

Because video input needs a certain time for extracting physiological signals, a Chinese academy of sciences CAS (ME) 2 expression database is selected, and due to the fact that the difference of various expression data amounts of the database is large, 55 anger data, 74 distortion data, 131 happy data, 36 surpride data and 21 fear data are finally selected, and 317 data are calculated in total.

The method comprises the following steps: the method comprises the steps of cutting video data containing expression actions, enabling the cut video to be required to be unified for 10 seconds, dividing the cut video into frames, extracting expression picture sequences from the beginning to the end of expressions, and normalizing the frame numbers of all the samples to be 120 through linear interpolation, wherein the shortest sample in a CAS (ME) 2 expression database comprises 4 frames, and the longest sample comprises 118 frames. Preprocessing the extracted expression sequence such as geometric correction and normalization;

step two: extracting the LBP-TOP feature of the sequence of facial expression pictures, as shown in fig. 2, specifically includes the steps of:

(1) converting the normalized picture sequence into a gray-scale image;

(2) setting LBP-TOP extraction parameters, and making the number P of neighborhood points of the LBP-TOP operator equal to P_XY＝P_XT＝P_YTThe radii of the x-axis, the y-axis and the t-axis are R respectively as 4_X＝R_Y＝1，R_TThe LBP mode selects a normalization mode, and the number of the blocks is 3 multiplied by 3;

(3) respectively calculating LBP values of the XY plane, the XT plane and the YT plane according to the set parameters, and connecting the LBP values of the three planes in series to obtain an LBP-TOP characteristic, wherein the calculation formula of the LBP value of each plane is as follows:

wherein (x)_c，y_c): the positions of the pixel points are the positions of the pixel points,is the gray value of the pixel point, and,the gray value of the central pixel point is, and p is the number of the adjacent points of the pixel point;

step three: pulse wave signals are extracted based on a chrominance model, and as shown in fig. 3, time domain and frequency domain features of the pulse wave signals are extracted.

Step four: fusing the expression characteristics with the time domain and frequency domain characteristics of the pulse wave signals, as shown in FIG. 4;

step five: dividing all facial expression data into a training set and a testing set, processing according to the four steps, and finally classifying by using a support vector machine, as shown in fig. 5:

(1) dividing the fused features into a test set and a training set at random, selecting a radial vector RBF kernel function which can map samples in a nonlinear way and has smaller numerical complexity, wherein the kernel function formula is as follows:

K(x_i，x_j)＝exp(-γ||x_i-x_j||²)

where γ > 0, the default value is 1/k, and k is the number of classes.

(2) And determining two parameters of a penalty factor C and a cross validation time, wherein the selection of the value C has an important influence on the classification accuracy, and the larger the value C is, the greater the penalty on errors is, but the greater the value C is, the overfitting is caused, so that the value C needs to be properly selected.

(3) And training a support vector machine by using the training set data, calculating the recognition rate by using the test set, finishing the training when the recognition result meets the expected requirement, otherwise optimizing a punishment factor C, and continuing the training until the training effect meets the expected requirement.

While the present invention has been described with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, which are illustrative and not restrictive, and it will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. An automatic emotion recognition method based on a bimodal signal is characterized by comprising the following steps:

2. The method as claimed in claim 1, wherein the step of extracting LBP-TOP features of the facial expression picture sequence in the second step is as follows:

wherein (x)_c，y_c) The positions of the pixel points are the positions of the pixel points,is the gray value of the pixel point, and,is the gray value of the central pixel point, and p is the number of the adjacent points of the pixel point.

3. The method of claim 1, wherein the third step of extracting the pulse wave signal of the sequence of facial expression pictures based on the chrominance model and extracting the time domain and frequency domain features of the pulse wave signal comprises the following steps:

defining the chrominance signal:

X_s=2R₁(n)-3G₁(n)

Y_s=1.5R₁(n)+G₁(n)-1.5B₁(n)

wherein R is₁(n)，G₁(n)，B₁(n) is the normalized color channel signal;

S＝X_f-αY_f

4. The method as claimed in claim 1, wherein the step of fusing the extracted LBP-TOP features of the sequence of facial expression pictures with the time domain and frequency domain features of the pulse wave signal comprises the following steps:

wherein ∑₁₁Is XCovariance matrix, Σ₂₂Is the covariance matrix of Y, Σ₁₂＝cov(X，Y)，∑₂₁Is₁₂Is transposed and solved to obtain w_x，w_yChange the current stateAnd as the combined features after projection, fusion of the two types of features is realized.

5. The method of claim 1, wherein in the fifth step, the training set is input into a support vector machine for training, and after training, the step of performing test optimization through the test set comprises:

K(x_i，x_j)＝exp(-γ||x_i-x_j||²)