CN113033450A - Multi-mode continuous emotion recognition method, service inference method and system - Google Patents

Multi-mode continuous emotion recognition method, service inference method and system Download PDF

Info

Publication number
CN113033450A
CN113033450A CN202110361649.6A CN202110361649A CN113033450A CN 113033450 A CN113033450 A CN 113033450A CN 202110361649 A CN202110361649 A CN 202110361649A CN 113033450 A CN113033450 A CN 113033450A
Authority
CN
China
Prior art keywords
emotion
emotion recognition
service
voice
user
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110361649.6A
Other languages
Chinese (zh)
Other versions
CN113033450B (en
Inventor
路飞
张龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong University
Original Assignee
Shandong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong University filed Critical Shandong University
Priority to CN202110361649.6A priority Critical patent/CN113033450B/en
Publication of CN113033450A publication Critical patent/CN113033450A/en
Application granted granted Critical
Publication of CN113033450B publication Critical patent/CN113033450B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • G06F18/24155Bayesian classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/254Fusion techniques of classification results, e.g. of results related to same input data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N7/00Computing arrangements based on specific mathematical models
    • G06N7/01Probabilistic graphical models, e.g. probabilistic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/172Classification, e.g. identification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/174Facial expression recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Software Systems (AREA)
  • Human Computer Interaction (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Signal Processing (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Algebra (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Optimization (AREA)
  • Mathematical Analysis (AREA)
  • Computational Mathematics (AREA)
  • Child & Adolescent Psychology (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a multi-mode continuous emotion recognition method, a service inference method and a system. The method comprises the following steps: acquiring video data containing facial expressions and voices of a user; extracting a face image from the video image sequence, and extracting the characteristics of the face image to obtain expression and emotion characteristics; carrying out continuous emotion recognition according to the expression emotion characteristics; for voice data, acquiring voice emotion characteristics; continuous emotion recognition is carried out according to the voice emotion characteristics, and the expression emotion recognition result and the voice emotion recognition result are fused, so that the defects of single mode in continuous emotion recognition are overcome, and the emotion recognition precision is improved; on the basis, service reasoning is carried out based on the multi-entity Bayesian network model, so that the service robot can adjust service dynamics according to the emotion of the user.

Description

Multi-mode continuous emotion recognition method, service inference method and system
Technical Field
The invention belongs to the technical field of service robots, and particularly relates to a multi-mode continuous emotion recognition method, a service inference method and a system.
Background
The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.
As the service robot plays an important role in a home scenario, natural human-computer interaction becomes one of the key factors affecting user satisfaction and human-computer coexistence comfort. The goal of the family service robot is to have cognitive ability on user emotion and provide high-quality service according to user emotion state.
According to the knowledge of the inventor, at present, the identification aiming at human emotion is mainly a discrete emotion model, but because the expression of human emotion is a complex continuous process, the discrete emotion model is difficult to fully express the emotion state of a user, so that the identification of the continuous emotion state of the user is quite necessary, meanwhile, because the calibration of the continuous emotion state is complex, a data set is rare, and the single-mode continuous emotion identification has the defects of low identification precision and poor robustness, so in order to further reduce the influence caused by the rare data set, improve the accuracy of emotion identification and enhance the robustness of an identification system, the complementarity among the modes needs to be explored, the emotion identification of multi-mode fusion is realized, and the final emotion identification quality is improved.
The service target of the family service robot is human, the current emotional state of a user is rarely considered in the service provided by the current service robot, the inference rule is hard inference, the dynamic change of the family environment is not considered, various uncertain factors are enriched, the inferred service result cannot serve the user well, and the intellectualization of the service robot cannot be reflected.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides a multi-mode continuous emotion recognition method, a service inference method and a system. By adopting a mode of combining expression and voice multi-modal emotion recognition results, the defects of a single mode in continuous emotion recognition are overcome, the emotion recognition precision is improved, and on the basis, service inference is carried out based on a multi-entity Bayesian network model, so that the service robot can adjust the service dynamics according to the emotion of a user.
In order to achieve the above object, one or more embodiments of the present invention provide the following technical solutions:
a multi-modal continuous emotion recognition method based on expressions and voice comprises the following steps:
acquiring video data containing facial expressions and voices of a user;
extracting a face image from the video image sequence, and extracting the characteristics of the face image to obtain expression and emotion characteristics;
according to the expression emotion characteristics, continuous emotion recognition is carried out based on a pre-trained deep learning model;
for voice data, acquiring voice emotion characteristics by utilizing a Mel frequency cepstrum coefficient;
according to the voice emotion characteristics, continuous emotion recognition is carried out on the basis of a pre-trained transfer learning network;
and fusing the expression emotion recognition result and the voice emotion recognition result to obtain a final recognition result.
And further, carrying out feature extraction on the face image by adopting Gabor wavelet transform to obtain expression emotional features.
Further, extracting the face image includes:
adopting a pre-trained neural network model to perform face recognition on a video image sequence, simultaneously recognizing abnormal video frames, and discarding the abnormal frames; wherein the neural network model cascades three convolutional neural networks of different depths.
Further, the voice data is also preprocessed:
processing voice data by utilizing a first-order non-recursive high-pass filter;
the speech data is subjected to framing processing, and a hamming window is added to realize smooth transition between two adjacent frames.
Further, the migration learning network sequentially comprises from the input end to the output end: the device comprises a first convolution layer, a pooling layer, a second convolution layer, a third convolution layer, a fourth convolution layer, a dropout layer and a full connection layer, wherein the full connection layer adopts a Tanh activation function.
Furthermore, the expression emotion recognition result and the voice emotion recognition result are fused by adopting multiple linear regression.
One or more embodiments provide a multimodal continuous emotion recognition system based on expression and speech, comprising:
a data acquisition module configured to acquire video data containing user facial expressions and voices;
the expression emotion recognition module is configured to extract a face image from the video image sequence and perform feature extraction on the face image to obtain expression emotion features; according to the expression emotion characteristics, continuous emotion recognition is carried out based on a pre-trained deep learning model;
the voice emotion recognition module is configured to acquire voice emotion characteristics by utilizing a Mel frequency cepstrum coefficient for voice data; according to the voice emotion characteristics, continuous emotion recognition is carried out on the basis of a pre-trained transfer learning network;
and the data fusion module is configured to fuse the expression emotion recognition result and the voice emotion recognition result to obtain a final recognition result.
One or more embodiments provide a robot service inference method including the steps of:
acquiring current environment information and user emotion information; wherein, the user emotion information is acquired by adopting the method;
reasoning to obtain a service task based on a pre-constructed multi-entity Bayesian network model; the multi-entity Bayesian network model comprises entity fragments corresponding to the environment variables and the user emotion variables, and relations among the environment variables, the user emotion variables and the service tasks.
One or more embodiments provide a robotic service inference system, comprising:
the data acquisition module is configured to acquire current environment information and user emotion information; wherein, the user emotion information is acquired by adopting the method;
the service inference module is configured to infer a service task based on a pre-constructed multi-entity Bayesian network model; the multi-entity Bayesian network model comprises entity fragments corresponding to the environment variables and the user emotion variables, and relations among the environment variables, the user emotion variables and the service tasks.
One or more embodiments provide a service robot configured to perform the multi-modal continuous emotion recognition method based on expressions and voices, or the robot service inference method.
One or more embodiments provide a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the multi-modal continuous emotion recognition method based on emotion and voice, or the robot service inference method, as described.
The above one or more technical solutions have the following beneficial effects:
by adopting a mode of combining expression and voice multi-mode emotion recognition results, the defects of single mode in continuous emotion recognition are overcome, and the emotion recognition precision is improved.
When emotion recognition based on expressions is carried out, the convolutional neural network based on the cascade architecture realizes face detection and discarding of abnormal frames in expression video frames, so that detection of the face from coarse to fine is realized, and the accuracy is higher; the method has the advantages that the incremental expression emotion feature data can be obtained by performing feature extraction through a Gabor conversion processing technology, the problem of low identification accuracy rate caused by rare data sets is solved, and the method is favorable for improving the accuracy of continuous emotion identification.
When emotion recognition based on voice is carried out, the voice emotion characteristics are obtained by utilizing the Mel frequency cepstrum coefficient, emotion recognition is carried out by utilizing the built transfer learning network, and the problem of low recognition accuracy caused by rare data sets is better solved by adopting a tanh activation function in a full connection layer.
On the basis of obtaining the continuous emotional state of the user, based on the multi-entity Bayesian network model, the uncertainty reasoning of the service can be realized, the service dynamic is adjusted according to the emotion of the user, and the satisfaction degree of the user is improved.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the invention and together with the description serve to explain the invention and not to limit the invention.
FIG. 1 is a flow diagram of a method for multi-modal continuous emotion recognition based on expression and speech in one or more embodiments of the invention;
FIG. 2 is an Arousal-Valence continuous emotion space model;
FIG. 3 is a schematic diagram illustrating facial image cropping rules in one or more embodiments of the present disclosure;
FIG. 4 is a schematic diagram of a migratory neural network architecture in accordance with one or more embodiments of the present invention;
FIG. 5 is a diagram illustrating emotion recognition results for various steps in one or more embodiments of the invention;
FIG. 6 is a flow diagram of a multi-entity Bayesian network based service recommendation in one or more embodiments of the invention;
FIG. 7 is a constructed family scene multi-entity Bayesian network model;
FIG. 8 shows the service robot service task inference result under the user emotional state at time t 1;
FIG. 9 shows the service robot service task inference result under the user emotional state at time t 2.
Detailed Description
It is to be understood that the following detailed description is exemplary and is intended to provide further explanation of the invention as claimed. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.
The embodiments and features of the embodiments of the present invention may be combined with each other without conflict.
Example one
The embodiment discloses a multi-modal continuous emotion recognition method based on expressions and voice, which comprises the following steps as shown in fig. 1:
step 1: acquiring video data containing facial expressions and voices of a user;
experimental validation on the AVEC2013 dataset was chosen for this example. The AVEC2013 database is a public data set provided by the third audio visual emotional challenge tournament, which not only contains facial expression and speech emotion data, but also has emotion tags with two continuous dimensions of Arousal and Valence as shown in fig. 2.
Step 2: extracting a face image based on a pre-trained face recognition model, and carrying out emotion recognition; the method specifically comprises the following steps:
step 2.1: the method comprises the steps of utilizing a convolutional neural network based on a cascade framework to realize face detection and discard of abnormal frames in expression video frames, and extracting face images;
firstly, face detection and automatic abnormal video frame discarding are realized by combining a Caffe deep learning frame, opencv2 and a cascaded convolutional neural network with three different depths; by cascading three convolutional neural networks of different depths, coarse-fine detection can be achieved. And then respectively carrying out face alignment, face region cutting, graying and gray level equalization processing on the obtained face video frames, wherein the cutting rule of the face region is as shown in fig. 3, the eye distance is set as d, the upper part 0.8d of the central connecting line of the two eyes is used as an upper boundary, the lower part 1.6d is used as a lower boundary, the middle point of the central connecting line of the two eyes is used as a center, and the distance d is respectively expanded left and right to obtain left and right boundaries. Grayscale equalization the processed image size becomes 120 × 120 pixels using the equalizehost function in opncv 2.
Step 2.2: obtaining data increment and expression emotional characteristics through a Gabor conversion processing technology;
extracting emotional characteristics of the processed face image by utilizing Gabor wavelet transform, and realizing data increment, wherein the mathematical formula of the Gabor wavelet transform is as follows:
Figure BDA0003005780680000061
Figure BDA0003005780680000062
Figure BDA0003005780680000063
in the formula
Figure BDA0003005780680000064
Which represents the directional selectivity of the filter and,
Figure BDA0003005780680000065
representing a given image coordinate vector, σ determines the ratio of the width of the gaussian window to the wave vector length, and in this embodiment σ is set to 2 π, kvRepresents the wavelength of the filter, and
Figure BDA0003005780680000066
and kvJointly determine the wave vector
Figure BDA0003005780680000067
The direction and the dimensions of. K denotes the total number of directions, f is the sampling step in the frequency domain, KmaxAt the maximum sampling frequency, u represents the direction of the filter, directionThe number is set to 8, v represents the filter scale, the scale is set to 5, and a total of 40 Gabor filter banks. i indicates that the function is a complex function.
Step 2.3: realizing continuous emotion recognition of the expression by combining a deep learning method;
emotional recognition of facial expressions was performed using the ResNet50 network in tenserflow and the MSI RTX 3090 graphics card.
And step 3: performing emotion recognition based on the voice data; the method specifically comprises the following steps:
step 3.1: the original speech signal is first preprocessed using MATLAB 2018a, the preprocessing process is detailed as follows:
the speech is first pre-emphasized using a first-order non-recursive high-pass filter to boost the high-frequency part. The mathematical formula of the high-pass filter is as follows:
H(z)=1-μz-1,0.9<μ<1 (4)
in the formula, mu is a pre-emphasis coefficient, and the pre-emphasis coefficient selected by the method is 0.97.
And performing framing processing on the voice signals, and dividing the longer voice into signals of one frame and one frame, wherein the time length of each frame of voice signals is 20-30 ms.
Then, a hamming window is added to the framed speech signal to achieve a smooth transition between the two frame signals.
Step 3.2: acquiring speech emotion characteristics by utilizing a Mel frequency cepstrum coefficient;
(1) the speech signal is transformed from a time domain signal to a frequency domain signal using a fast fourier transform.
(2) Smoothing the frequency spectrum obtained before through a Mel filter bank, eliminating the influence of harmonic wave, highlighting the formant of the original voice, and reducing the operation amount. The mathematical formula of the mel filter bank is expressed as:
Figure BDA0003005780680000071
where f (M) is the center frequency of the mel filter bank, and M is 0, 1, 2, 3, …, and M, M is set to 23. k represents frequencyRate, simultaneous Σ Hm(k)=1。
(3) The speech signal is subjected to dimensionality reduction compression using a discrete cosine transform method.
Because the voice signal is dynamic in the time domain, in order to enable the emotional characteristic to reflect the continuity of the voice signal, the dimensionality of information between two frames is increased, and the dynamic characteristic of the voice is obtained by carrying out dynamic first-order difference and second-order difference on the static characteristic of data.
(4) And finally, combining the first-order difference dynamic feature, the second-order dynamic feature and the static feature to obtain the 40-dimensional Mel frequency cepstrum coefficient emotional feature.
After the emotional characteristics of the voice are obtained, the frame number of the obtained emotional characteristic parameters is not uniform due to the fact that the duration of each voice sample is not consistent, and voice emotional state recognition of the user is not facilitated. Therefore, most of the features of the speech signal are preserved for further speech emotion recognition. The method uniformly clips each sample feature data into 160 frames, supplements 0 to the data smaller than 160 frames, and directly clips the data larger than 160 frames. And converting the characteristic parameters into the speech emotion characteristics in the form of an 80 x 80 matrix.
Step 3.3: performing emotion recognition by using the built transfer learning network;
the method comprises the steps of pre-training a used transfer learning network by utilizing an IEMOCAP database to generate a pre-training model, then finely adjusting parameters by utilizing a data set in an AVEC2013 database, and finally realizing the recognition of the speech emotion state of a user under the condition of rare data sets.
In order to improve the accuracy of continuous emotion recognition, as shown in fig. 4, the migratory learning neural network is composed of 7 convolutional layers (Conv2D), a Pooling layer (Pooling), and 1 fully-connected layer containing one Dropout layer in order to avoid overfitting, and the activation functions are selected from Relu and Tanh functions. Relu is used for convolutional layers and Tanh function is mainly applied for fully-connected layers. For better recognition of continuous emotion, because the label of the continuous emotion is between-1 and 1, it is more appropriate to select Tanh for the full link layer.
And 4, step 4: fusing emotion recognition results based on the face image data and the voice data;
and fusing the expression and voice emotion recognition results by utilizing multiple linear regression to realize multi-mode fused emotion recognition. The equation for multiple linear regression is shown below:
Y=β01x12x23x3+…+βkxk+ε (6)
in the formula beta0Is a regression constant; beta is a1,β2,β3,…,βkRepresenting a regression coefficient; y represents a dependent variable; x is the number of1,x2,x3,…,xkRepresents an independent variable; ε represents the random error.
In order to better evaluate the accuracy of the user emotion state identification result, a consistency correlation coefficient is adopted to evaluate the identification result, and the mathematical formula of the consistency correlation coefficient is as follows:
Figure BDA0003005780680000091
in the formula, theta and theta' are respectively a true value and a predicted value; sθθ′Covariance of true and predicted values;
Figure BDA0003005780680000092
and
Figure BDA0003005780680000093
respectively the standard deviation of the true value and the predicted value;
Figure BDA0003005780680000094
and
Figure BDA0003005780680000095
the actual value and the predicted value are corresponding mean values respectively. The closer the numerical value is to 1, the higher the accuracy of the identification result is proved to be. The experimental result pairs of the emotion recognition steps are shown in fig. 5.
Based on the emotion recognition method, the embodiment provides a service robot task reasoning method, and as shown in fig. 6, the method mainly selects uncertainty information such as user emotion state, pressure state, time of the user and brightness factor to reason out the service task of the service robot. Firstly, an MEBN model of a service scene is constructed in advance, and the method specifically comprises the following steps:
analyzing the emotional state and the environmental information of the user, respectively constructing entity segments aiming at the environmental variable and the emotional variable of the user in the service scene, taking other variables related to the variable in the entity segments corresponding to the variables as context random variable nodes, and integrating the entity segments to obtain the Bayesian network with the specific context. In this embodiment, the environment variables include time, weather, brightness, and the like, and the user emotion variables include emotional states, pressure states, and the like; a home dynamic scenario model built using a multi-entity bayesian network is shown in fig. 7. The specific context Bayesian network also inputs the incidence relation of the environment variable, the user variable and the service content in advance.
And setting initial probability distribution of the entity fragments to obtain a multi-entity Bayesian model of the service scene. Setting the initial probability of the emotional state of the user to be 90.0 percent of pleasure, 0.0 percent of both heartburn and calmness and 10 percent of angry. And after the probability distribution is determined by the entity segments and the consistent constraint condition is met, namely the unique joint probability distribution exists, generating the specific situation Bayesian network.
The inference method specifically comprises the following steps:
step 1: acquiring current environment state and user emotional state information;
step 2: retrieving and instantiating matched entity fragments according to the multi-entity Bayesian network model corresponding to the scene;
and step 3: carrying out recursion combination on the instantiated entity fragments, and carrying out consistency constraint inspection;
and 4, step 4: generating a descriptive Bayesian network after passing the inspection;
and 5: and realizing an optimal reasoning result according to a link tree reasoning algorithm.
In order to verify the effectiveness of the method, the state with the highest probability among the influencing factors of the user at the time t1 is assumed to be the self state, namely, the emotional state is happy, the time is night, the light is bright, the self pressure is normal, the result deduced by using the method is watching television, and the result is shown in fig. 8. At time t2, the emotional state of the user changes, and changes with the pressure, and the probability that the emotional state is angry and the pressure is light is 69.85%, and other factors are not changed. The service task reasoning result obtained by the reasoning algorithm of the invention at the moment is music playing. The results are shown in FIG. 9.
The reasoning results at two moments are compared to prove that the service task of the service robot deduced by the method accords with the daily life situation, when the emotional state of the user is happy, the service provided by the service robot is used for watching TV, the user can conveniently enjoy entertainment, and when the user is angry, the emotional state of the user is adjusted by providing the music playing service.
Example two
The embodiment aims to provide a multi-modal continuous emotion recognition system based on expressions and voice, which comprises:
a data acquisition module configured to acquire video data containing user facial expressions and voices;
the expression emotion recognition module is configured to extract a face image from the video image sequence and perform feature extraction on the face image to obtain expression emotion features; according to the expression emotion characteristics, continuous emotion recognition is carried out based on a pre-trained deep learning model;
the voice emotion recognition module is configured to acquire voice emotion characteristics by utilizing a Mel frequency cepstrum coefficient for voice data; according to the voice emotion characteristics, continuous emotion recognition is carried out on the basis of a pre-trained transfer learning network;
and the data fusion module is configured to fuse the expression emotion recognition result and the voice emotion recognition result to obtain a final recognition result.
On this basis, the embodiment further provides a robot service inference system, which includes the above multi-modal continuous emotion recognition system, and further includes:
the data acquisition module is configured to acquire current environment information and user emotion information; the user emotion information is acquired by adopting a multi-mode continuous emotion recognition system;
the service inference module is configured to infer a service task based on a pre-constructed multi-entity Bayesian network model; the multi-entity Bayesian network model comprises entity fragments corresponding to the environment variables and the user emotion variables, and relations among the environment variables, the user emotion variables and the service tasks.
EXAMPLE III
An object of the present embodiment is to provide a computer-readable storage medium.
A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out a method according to one of the embodiments.
Example four
The purpose of this embodiment is to provide a service robot.
A service robot comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method according to one of the embodiments when executing the program.
The steps involved in the apparatuses of the above second, third and fourth embodiments correspond to the first embodiment of the method, and the detailed description thereof can be found in the relevant description of the first embodiment. The term "computer-readable storage medium" should be taken to include a single medium or multiple media containing one or more sets of instructions; it should also be understood to include any medium that is capable of storing, encoding or carrying a set of instructions for execution by a processor and that cause the processor to perform any of the methods of the present invention.
One or more of the above embodiments have the following technical effects:
real-time scene information in the smart space is input for robot service awareness, and user emotion information is a key input attribute in the real-time scene information. The emotional information is added, so that the robot adds emotional care on the basis of basic service. The service contents provided by the robot are different in the same scene and different emotional states.
The robot service task cognition combines emotional information, case reasoning and a user preference degree model. Taking the user emotional state as a core, and performing robot service autonomous cognition based on a case reasoning method to enable the robot to provide services with emotional temperature for the user; the user preference degree model is utilized to refine the service granularity and meet the personalized service requirements of the user; the closed-loop system based on the user emotion feedback evaluation mechanism realizes the correction of the service cognition result, enhances the attaching degree of the service cognition model and the user preference, reduces the manual participation and improves the intellectualization and emotion commercialization of the service robot service; after each service is finished, the robot self-learns the service knowledge and provides experience for subsequent service cognition.
Those skilled in the art will appreciate that the modules or steps of the present invention described above can be implemented using general purpose computer means, or alternatively, they can be implemented using program code that is executable by computing means, such that they are stored in memory means for execution by the computing means, or they are separately fabricated into individual integrated circuit modules, or multiple modules or steps of them are fabricated into a single integrated circuit module. The present invention is not limited to any specific combination of hardware and software.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
Although the embodiments of the present invention have been described with reference to the accompanying drawings, it is not intended to limit the scope of the present invention, and it should be understood by those skilled in the art that various modifications and variations can be made without inventive efforts by those skilled in the art based on the technical solution of the present invention.

Claims (10)

1. A multi-modal continuous emotion recognition method based on expressions and voice is characterized by comprising the following steps:
acquiring video data containing facial expressions and voices of a user;
extracting a face image from the video image sequence, and extracting the characteristics of the face image to obtain expression and emotion characteristics;
according to the expression emotion characteristics, continuous emotion recognition is carried out based on a pre-trained deep learning model;
for voice data, acquiring voice emotion characteristics by utilizing a Mel frequency cepstrum coefficient;
according to the voice emotion characteristics, continuous emotion recognition is carried out on the basis of a pre-trained transfer learning network;
and fusing the expression emotion recognition result and the voice emotion recognition result to obtain a final recognition result.
2. The multi-modal continuous emotion recognition method based on expressions and voices as claimed in claim 1, wherein the expression emotion characteristics are obtained by performing feature extraction on the face image by using Gabor wavelet transform.
3. The multi-modal continuous emotion recognition method based on expressions and voices as claimed in claim 1, wherein extracting the face image comprises:
adopting a pre-trained neural network model to perform face recognition on a video image sequence, simultaneously recognizing abnormal video frames, and discarding the abnormal frames; wherein the neural network model cascades three convolutional neural networks of different depths.
4. The multi-modal continuous emotion recognition method based on expression and speech of claim 1, wherein the speech data is further preprocessed by:
processing voice data by utilizing a first-order non-recursive high-pass filter;
the speech data is subjected to framing processing, and a hamming window is added to realize smooth transition between two adjacent frames.
5. The multimodal continuous emotion recognition method based on expressions and voices, as claimed in claim 1, wherein the migration learning network comprises, in order from input end to output end: the device comprises a first convolution layer, a pooling layer, a second convolution layer, a third convolution layer, a fourth convolution layer, a dropout layer and a full connection layer, wherein the full connection layer adopts a Tanh activation function.
6. A multi-modal continuous emotion recognition system based on expressions and speech, comprising:
a data acquisition module configured to acquire video data containing user facial expressions and voices;
the expression emotion recognition module is configured to extract a face image from the video image sequence and perform feature extraction on the face image to obtain expression emotion features; according to the expression emotion characteristics, continuous emotion recognition is carried out based on a pre-trained deep learning model;
the voice emotion recognition module is configured to acquire voice emotion characteristics by utilizing a Mel frequency cepstrum coefficient for voice data; according to the voice emotion characteristics, continuous emotion recognition is carried out on the basis of a pre-trained transfer learning network;
and the data fusion module is configured to fuse the expression emotion recognition result and the voice emotion recognition result to obtain a final recognition result.
7. A robot service inference method is characterized by comprising the following steps:
acquiring current environment information and user emotion information; wherein, the user emotion information is obtained by the method of any one of claims 1-5;
reasoning to obtain a service task based on a pre-constructed multi-entity Bayesian network model; the multi-entity Bayesian network model comprises entity fragments corresponding to the environment variables and the user emotion variables, and relations among the environment variables, the user emotion variables and the service tasks.
8. A robotic service inference system, comprising:
the data acquisition module is configured to acquire current environment information and user emotion information; wherein, the user emotion information is obtained by the method of any one of claims 1-5;
the service inference module is configured to infer a service task based on a pre-constructed multi-entity Bayesian network model; the multi-entity Bayesian network model comprises entity fragments corresponding to the environment variables and the user emotion variables, and relations among the environment variables, the user emotion variables and the service tasks.
9. A service robot, characterized by being configured to perform the multi-modal emotion-continuance recognition method based on expressions and speech as claimed in any one of claims 1 to 5, or the robot service inference method as claimed in claim 7.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the multi-modal continuous emotion recognition method based on emotion and speech according to any of claims 1 to 5, or the robot service inference method according to claim 7.
CN202110361649.6A 2021-04-02 2021-04-02 Multi-mode continuous emotion recognition method, service inference method and system Active CN113033450B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110361649.6A CN113033450B (en) 2021-04-02 2021-04-02 Multi-mode continuous emotion recognition method, service inference method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110361649.6A CN113033450B (en) 2021-04-02 2021-04-02 Multi-mode continuous emotion recognition method, service inference method and system

Publications (2)

Publication Number Publication Date
CN113033450A true CN113033450A (en) 2021-06-25
CN113033450B CN113033450B (en) 2022-06-24

Family

ID=76453612

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110361649.6A Active CN113033450B (en) 2021-04-02 2021-04-02 Multi-mode continuous emotion recognition method, service inference method and system

Country Status (1)

Country Link
CN (1) CN113033450B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113011551A (en) * 2021-04-02 2021-06-22 山东大学 Robot service cognition method and system based on user emotion feedback
CN113420556A (en) * 2021-07-23 2021-09-21 平安科技(深圳)有限公司 Multi-mode signal based emotion recognition method, device, equipment and storage medium
CN113433874A (en) * 2021-07-21 2021-09-24 广东工业大学 5G-based unmanned ship comprehensive control management system and method
CN113807249A (en) * 2021-09-17 2021-12-17 广州大学 Multi-mode feature fusion based emotion recognition method, system, device and medium
CN115620268A (en) * 2022-12-20 2023-01-17 深圳市徐港电子有限公司 Multi-modal emotion recognition method and device, electronic equipment and storage medium
CN116071810A (en) * 2023-04-03 2023-05-05 中国科学技术大学 Micro expression detection method, system, equipment and storage medium
CN116682168A (en) * 2023-08-04 2023-09-01 阳光学院 Multi-modal expression recognition method, medium and system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102360457A (en) * 2011-10-20 2012-02-22 北京邮电大学 Bayesian network and ontology combined reasoning method capable of self-perfecting network structure
CN105976809A (en) * 2016-05-25 2016-09-28 中国地质大学(武汉) Voice-and-facial-expression-based identification method and system for dual-modal emotion fusion
CN108334583A (en) * 2018-01-26 2018-07-27 上海智臻智能网络科技股份有限公司 Affective interaction method and device, computer readable storage medium, computer equipment
CN108664932A (en) * 2017-05-12 2018-10-16 华中师范大学 A kind of Latent abilities state identification method based on Multi-source Information Fusion
CN110516696A (en) * 2019-07-12 2019-11-29 东南大学 It is a kind of that emotion identification method is merged based on the adaptive weighting bimodal of voice and expression

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107609572B (en) * 2017-08-15 2021-04-02 中国科学院自动化研究所 Multi-modal emotion recognition method and system based on neural network and transfer learning
CN108960191B (en) * 2018-07-23 2021-12-14 厦门大学 Multi-mode fusion emotion calculation method and system for robot
CN111862984B (en) * 2019-05-17 2024-03-29 北京嘀嘀无限科技发展有限公司 Signal input method, device, electronic equipment and readable storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102360457A (en) * 2011-10-20 2012-02-22 北京邮电大学 Bayesian network and ontology combined reasoning method capable of self-perfecting network structure
CN105976809A (en) * 2016-05-25 2016-09-28 中国地质大学(武汉) Voice-and-facial-expression-based identification method and system for dual-modal emotion fusion
CN108664932A (en) * 2017-05-12 2018-10-16 华中师范大学 A kind of Latent abilities state identification method based on Multi-source Information Fusion
CN108334583A (en) * 2018-01-26 2018-07-27 上海智臻智能网络科技股份有限公司 Affective interaction method and device, computer readable storage medium, computer equipment
CN110516696A (en) * 2019-07-12 2019-11-29 东南大学 It is a kind of that emotion identification method is merged based on the adaptive weighting bimodal of voice and expression

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
朱晨岗: ""基于视听觉感知系统的情感识别技术研究"", 《中国优秀博硕士学位论文全文数据库(硕士)信息科技辑》 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113011551A (en) * 2021-04-02 2021-06-22 山东大学 Robot service cognition method and system based on user emotion feedback
CN113433874A (en) * 2021-07-21 2021-09-24 广东工业大学 5G-based unmanned ship comprehensive control management system and method
CN113420556A (en) * 2021-07-23 2021-09-21 平安科技(深圳)有限公司 Multi-mode signal based emotion recognition method, device, equipment and storage medium
CN113420556B (en) * 2021-07-23 2023-06-20 平安科技(深圳)有限公司 Emotion recognition method, device, equipment and storage medium based on multi-mode signals
CN113807249A (en) * 2021-09-17 2021-12-17 广州大学 Multi-mode feature fusion based emotion recognition method, system, device and medium
CN113807249B (en) * 2021-09-17 2024-01-12 广州大学 Emotion recognition method, system, device and medium based on multi-mode feature fusion
CN115620268A (en) * 2022-12-20 2023-01-17 深圳市徐港电子有限公司 Multi-modal emotion recognition method and device, electronic equipment and storage medium
CN116071810A (en) * 2023-04-03 2023-05-05 中国科学技术大学 Micro expression detection method, system, equipment and storage medium
CN116682168A (en) * 2023-08-04 2023-09-01 阳光学院 Multi-modal expression recognition method, medium and system
CN116682168B (en) * 2023-08-04 2023-10-17 阳光学院 Multi-modal expression recognition method, medium and system

Also Published As

Publication number Publication date
CN113033450B (en) 2022-06-24

Similar Documents

Publication Publication Date Title
CN113033450B (en) Multi-mode continuous emotion recognition method, service inference method and system
DE112017003563B4 (en) METHOD AND SYSTEM OF AUTOMATIC LANGUAGE RECOGNITION USING POSTERIORI TRUST POINT NUMBERS
CN112348075B (en) Multi-mode emotion recognition method based on contextual attention neural network
US11281945B1 (en) Multimodal dimensional emotion recognition method
CN111930992B (en) Neural network training method and device and electronic equipment
DE102019122180A1 (en) METHOD AND SYSTEM FOR KEY EXPRESSION DETECTION BASED ON A NEURONAL NETWORK
CN111312245B (en) Voice response method, device and storage medium
DE112020002531T5 (en) EMOTION DETECTION USING SPEAKER BASELINE
CN112233698A (en) Character emotion recognition method and device, terminal device and storage medium
CN114550057A (en) Video emotion recognition method based on multi-modal representation learning
CN115293132B (en) Dialog of virtual scenes a treatment method device, electronic apparatus, and storage medium
CN116050496A (en) Determination method and device, medium and equipment of picture description information generation model
CN115359534B (en) Micro-expression identification method based on multi-feature fusion and double-flow network
Wehenkel et al. Diffusion priors in variational autoencoders
DE102022131824A1 (en) Visual speech recognition for digital videos using generative-adversative learning
Stappen et al. From speech to facial activity: towards cross-modal sequence-to-sequence attention networks
CN113139525A (en) Multi-source information fusion-based emotion recognition method and man-machine interaction system
CN109961152B (en) Personalized interaction method and system of virtual idol, terminal equipment and storage medium
CN116167015A (en) Dimension emotion analysis method based on joint cross attention mechanism
CN115472182A (en) Attention feature fusion-based voice emotion recognition method and device of multi-channel self-encoder
CN115376214A (en) Emotion recognition method and device, electronic equipment and storage medium
CN110188367B (en) Data processing method and device
Sadok et al. A multimodal dynamical variational autoencoder for audiovisual speech representation learning
CN110958417A (en) Method for removing compression noise of video call video based on voice clue
CN117115312B (en) Voice-driven facial animation method, device, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant