CN110556129A

CN110556129A - Bimodal emotion recognition model training method and bimodal emotion recognition method

Info

Publication number: CN110556129A
Application number: CN201910851155.9A
Authority: CN
Inventors: 邹月娴; 张钰莹; 甘蕾
Original assignee: Peking University Shenzhen Graduate School
Current assignee: Peking University Shenzhen Graduate School
Priority date: 2019-09-09
Filing date: 2019-09-09
Publication date: 2019-12-10
Anticipated expiration: 2039-09-09
Also published as: CN110556129B

Abstract

The application provides a training method of a bimodal emotion recognition model and a bimodal emotion recognition method, wherein the training method of the bimodal emotion recognition model comprises the following steps: inputting voice training data into a first neural network model for training to obtain a voice emotion recognition model; inputting image training data into a second neural network model, and performing first-stage supervised training by adopting a first loss function to obtain a first-stage initial image emotion recognition model; inputting image training data into an initial image emotion recognition model in a first stage, adopting a second loss function to perform supervised training in a second stage to obtain a target image emotion recognition model, and performing decision-making level fusion on the voice emotion recognition model and the target image emotion recognition model to obtain a bimodal emotion recognition model.

Description

Bimodal emotion recognition model training method and bimodal emotion recognition method

Technical Field

The application relates to the technical field of voice processing and image processing, in particular to a training method of a bimodal emotion recognition model and a bimodal emotion recognition method.

Background

The dual-mode emotion recognition integrates multiple subjects such as voice signal processing, digital image processing, mode recognition, psychology and the like, is an important branch of human-computer interaction, and is beneficial to providing better and more humanized user experience for the human-computer interaction, so that the robot can sense and analyze the emotion state of a user and generate corresponding response, and therefore, the emotion recognition has wide research and application prospects as an important capability of the robot. But the accuracy of existing emotion recognition is relatively low.

Disclosure of Invention

In view of this, an embodiment of the present application is directed to providing a training method for a bimodal emotion recognition model and a bimodal emotion recognition method. The emotional effect of the user can be more accurately identified.

in a first aspect, an embodiment of the present application provides a method for training a bimodal emotion recognition model, including:

Inputting voice training data into a first neural network model for training to obtain a voice emotion recognition model;

Inputting image training data into a second neural network model, and performing first-stage supervised training by adopting a first loss function to obtain a first-stage initial image emotion recognition model;

inputting the image training data into the initial image emotion recognition model of the first stage, and performing supervised training of a second stage by adopting a second loss function to obtain a target image emotion recognition model;

And performing decision-making level fusion on the voice emotion recognition model and the target image emotion recognition model to obtain a bimodal emotion recognition model.

With reference to the first aspect, an embodiment of the present application provides a first possible implementation manner of the first aspect, where: the step of inputting the image training data into a second neural network model and adopting a first loss function to perform supervised training of a first stage so as to obtain an initial image emotion recognition model of the first stage comprises the following steps:

Inputting the image training data into the second neural network model, and performing first-stage supervised training by adopting a cross entropy loss function to obtain a first-stage initial image emotion recognition model;

The step of inputting the image training data into the initial image emotion recognition model of the first stage and adopting a second loss function to perform supervised training of the second stage to obtain a target image emotion recognition model comprises the following steps:

And inputting the image training data into the initial image emotion recognition model in the first stage, and performing supervised training in the second stage by adopting a focus loss function to obtain a target image emotion recognition model.

According to the image emotion recognition model training method provided by the embodiment of the application, due to the existence of the hard-to-divide facial expression sample, the expression misclassification problem cannot be effectively solved by directly adopting the cross entropy loss function to carry out network training, and therefore a two-stage training strategy is used in facial expression recognition to extract facial expression features with higher distinguishability. In detail, in the first stage of image emotion recognition model training, cross entropy loss functions are used for supervised training, so that the trained model has preliminary distinguishing capability, and further, in the second stage, focus loss functions are used for supervised training, so that the trained model can finely distinguish confusable features, and therefore the recognition accuracy of the trained model is relatively higher.

with reference to the first aspect or the first possible implementation manner of the first aspect, an embodiment of the present application provides a second possible implementation manner of the first aspect, where: the second neural network model includes an activation function comprising a linear rectification function, an optimizer comprising a stochastic gradient descent algorithm.

According to the image emotion recognition model training method provided by the embodiment of the application, the diversification requirements required by the image emotion recognition model can be better adapted by combining the linear rectification function, the optimizer comprises a random gradient descent algorithm, relatively fast convergence can be realized, and the training speed of the image emotion recognition model is improved.

with reference to the first aspect, an embodiment of the present application provides a third possible implementation manner of the first aspect, where the step of inputting speech training data into the first neural network model for training to obtain a speech emotion recognition model includes:

And inputting the voice training data into a first neural network model, and performing supervised training by adopting a combined loss function consisting of an affinity loss function and a focus loss function to obtain a voice emotion recognition model.

According to the speech emotion recognition model training method provided by the embodiment of the application, supervised training can be performed on the first neural network model by using Affinity loss, so that the trained model can better recognize characteristics in speech. Aiming at the problems of easy confusion of emotion and unbalanced emotion data types, the speech emotion recognition model references the thought of metric learning and uses the joint loss of Affinity loss and Focal loss as a loss function. Compared with the existing method only using cross entropy as a loss function, the combined loss function provided by the embodiment of the application improves the feature distinguishability and relieves the problem of the unbalanced emotion data types.

With reference to the first aspect or the third possible implementation manner of the first aspect, this example provides a fourth possible implementation manner of the first aspect, where the first neural network model includes an input layer, a hidden layer, an output layer, and an optimizer, where the hidden layer and the output layer include activation functions, the activation function of the hidden layer includes a maximum feature mapping function, the activation function of the output layer includes a softmax function, and the optimizer includes a RMSProp function.

The bimodal emotion recognition model training method provided by the embodiment of the application can also realize establishment of an initial model of the speech emotion recognition model through the structure of the first neural network model.

With reference to the first aspect, an embodiment of the present application provides a fifth possible implementation manner of the first aspect, where the method further includes:

And constructing a training database comprising the voice training data and the image training data.

With reference to the fifth possible implementation manner of the first aspect, this application provides a sixth possible implementation manner of the first aspect, where the step of constructing a training database including the speech training data and the image training data includes:

Recording voice in a target environment by using an acoustic vector sensor, and coding the acquired voice signal by using pulse code modulation of specified positions to obtain an initial voice data set;

Pre-processing the initial speech data set, the pre-processing comprising: selecting one or more of complete sentence voice data in the initial voice data set, removing noise of the voice data in the initial voice data set, and removing silence data in the voice data in the initial voice data set;

Naming the preprocessed initial voice data according to a first set naming rule to obtain a voice training data set, wherein the voice training data is data in the voice training data set;

Recording the video in the target environment to obtain initial video data;

Correspondingly cutting the initial video data and the voice data in the voice training data set to obtain a video training data set, wherein the image training data is one or more frames of images in the video data in the video training data set, and the training data set comprises the voice training data set and the video training data set.

the bimodal emotion recognition model training method provided by the embodiment of the application can also form voice training data and video training data according to data collected on site, and can better represent the expression of a user, so that the training data can better train the model.

In a second aspect, an embodiment of the present application further provides a bimodal emotion recognition method, including:

acquiring voice data generated by a target user in a target time period;

acquiring video data of the target user in the target time period;

Recognizing the voice data by using the voice emotion recognition model in the first aspect or any one of the possible implementation manners of the first aspect to obtain a first emotion recognition result;

Performing emotion recognition on each picture in the video data by using the first aspect or the image emotion recognition model in any one of the possible implementation manners of the first aspect to obtain an image emotion recognition result of each picture;

Determining a second emotion recognition result according to the image emotion recognition result of each image;

and determining the emotion recognition result of the target user according to the first emotion recognition result and the second emotion recognition result.

according to the bimodal emotion recognition method provided by the embodiment of the application, the emotion of the user can be better represented by the combination of training the voice and the image through the bimodulus mode. Further, as for the recognition of multiple pictures in a video, the recognition results based on the multiple pictures are fused, so that the expression of the user on the face can be better represented, and the emotion of the user can be more vividly represented.

In combination with the second aspect, the present embodiments provide a first possible implementation manner of the second aspect, where: the first emotion recognition result is a first probability matrix formed by probability values corresponding to all emotion classifications, and the second emotion recognition result is a second probability matrix formed by probability values corresponding to all emotion classifications; the step of determining the emotion recognition result of the target user according to the first emotion recognition result and the second emotion recognition result comprises the following steps:

Carrying out weighted summation on the first probability matrix and the second probability matrix to determine an emotion probability matrix of the target user;

and determining the current emotion type of the target user according to the emotion probability matrix.

The bimodal emotion recognition method provided by the embodiment of the application can also perform weighted summation on the first probability matrix and the second probability matrix, and can effectively match the importance of the first probability matrix and the second probability matrix, so that the emotion recognition is realized.

In combination with the second aspect, the present embodiments provide a second possible implementation manner of the second aspect, where: the step of performing weighted summation on the first probability matrix and the second probability matrix to determine the emotion probability matrix of the target user includes:

weighting the first probability matrix with a first weight and the second probability matrix with a second weight;

And summing the weighted first probability matrix and the weighted second probability matrix to obtain the emotion probability matrix of the target user, wherein the first weight is equal to the second weight.

According to the bimodal emotion recognition model training method provided by the embodiment of the application, the importance of voice and expression can be balanced by the first weight being equal to the second weight, so that the emotion of a user is expressed in a balanced manner.

In a third aspect, an embodiment of the present application further provides a device for training a bimodal emotion recognition model, including:

The first training module is used for inputting voice training data into the first neural network model for training so as to obtain a voice emotion recognition model;

The second training module is used for inputting the image training data into a second neural network model and performing supervised training of the first stage by adopting a first loss function so as to obtain an initial image emotion recognition model of the first stage;

And the third training module is used for inputting the image training data into the initial image emotion recognition model in the first stage, adopting a second loss function to perform supervised training in the second stage so as to obtain a target image emotion recognition model, and performing decision-making level fusion on the voice emotion recognition model and the target image emotion recognition model so as to obtain a bimodal emotion recognition model.

In a fourth aspect, an embodiment of the present application further provides a bimodal emotion recognition apparatus, including:

the first acquisition module is used for acquiring voice data generated by a target user in a target time period;

The second acquisition module is used for acquiring the video data of the target user in the target time period;

A first recognition module, configured to use the speech emotion recognition model in the first aspect or any one of the possible implementation manners of the first aspect to recognize the speech data, so as to obtain a first emotion recognition result;

A second emotion recognition module, configured to recognize each picture in the video data by using the target image emotion recognition model in the first aspect or any one of possible implementation manners of the first aspect, and obtain an image emotion recognition result of each picture;

the first determining module is used for determining a second emotion recognition result according to the image emotion recognition result of each image;

and the second determining module is used for determining the emotion recognition result of the target user according to the first emotion recognition result and the second emotion recognition result.

In a fifth aspect, an embodiment of the present application further provides an electronic device, including: a processor, a memory storing machine-readable instructions executable by the processor, the machine-readable instructions, when executed by the processor, performing the steps of the method of the first aspect described above, or any possible implementation of the first aspect, when the electronic device is run.

In a sixth aspect, this embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and the computer program is executed by a processor to perform the steps of the method in the first aspect, or any one of the possible implementations of the first aspect, or the second aspect, or any one of the possible implementations of the second aspect.

The bimodal emotion recognition model training method, the bimodal emotion recognition method, the device, the electronic equipment and the computer readable storage medium provided by the embodiment of the application adopt the functions of a speech emotion recognition model and a target image emotion recognition model to form an emotion recognition model in public, and further the training process of the target image emotion recognition model comprises the following steps: the first stage is called a 'feature separable' stage, the second stage is called a 'feature more distinguishing' stage, and the dual-stage model training can enable the emotion recognition accuracy of the dual-mode emotion recognition model to be relatively higher.

In order to make the aforementioned objects, features and advantages of the present application more comprehensible, embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained from the drawings without inventive effort.

Fig. 1 is a block diagram of an electronic device according to an embodiment of the present disclosure.

FIG. 2 is a flowchart of a bimodal emotion recognition model training method provided in an embodiment of the present application.

FIG. 3 is a partial flowchart of a bimodal emotion recognition model training method provided in an embodiment of the present application.

FIG. 4 is a functional block diagram of a bimodal emotion recognition model training apparatus provided in an embodiment of the present application.

FIG. 5 is a flowchart of a bimodal emotion recognition model training method provided in an embodiment of the present application.

FIG. 6 is a detailed flowchart of step 406 of a bimodal emotion recognition model training method provided in an embodiment of the present application.

FIG. 7 is a functional block diagram of a bimodal emotion recognition model training apparatus according to an embodiment of the present application.

Detailed Description

The technical solution in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures. Meanwhile, in the description of the present application, the terms "first", "second", and the like are used only for distinguishing the description, and are not to be construed as indicating or implying relative importance.

With the development of neural network technology, human expressions can also be recognized by machines. Human emotions can be expressed through expressions, and the emotions of corresponding users need to be concerned in a plurality of fields, so that emotion recognition is also applied to various industries. For example, emotion recognition can be used in distance education, the emotional state of a learner is detected and fed back to a teacher in real time, and the teacher can find the situation in time and make corresponding adjustment, so that the teaching quality is improved. For another example, emotion recognition can also be used in an on-vehicle system, and emotion recognition is used for monitoring the emotion of a car driver, so that the driver in a fatigue state or with strong emotion changes is reminded or pacified, and the possibility of traffic accidents is reduced. For another example, emotion recognition can also be used for a family service robot, and the robot receives emotion information, performs calculation and analysis and makes corresponding emotion feedback, so that a consumer has better user experience. For another example, emotion recognition can also be used in clinical medicine, and emotion changes of depression patients or autistic children can be tracked by means of emotion recognition as a basis for disease diagnosis and treatment.

existing emotion recognition is mainly based on single-mode emotion recognition (speech emotion recognition or image emotion recognition). However, the emotion feature information adopted by the single-mode emotion recognition is relatively single, so that the emotion recognition has certain limitations in accuracy and expression comprehensiveness. For example, when people are pleased, besides the expression that the corners of the mouth are raised and the muscles of the face are relaxed, the tone of the speaking is slightly increased, the tone color becomes light, and therefore, the emotional information transmitted by a single mode lacks completeness. Limitations of speech emotion recognition include: noise exists in the environment; there is a difference between the voices of different objects; when speaking, for the condition that a word has multiple meanings or a sentence has multiple meanings, if the intonation and the speech speed have no obvious change, the emotional state of the speaker at the moment is difficult to judge; lack of large-scale training data, etc. Both of these limitations result in relatively low speech emotion recognition accuracy. Limitations of facial expression recognition include: for people with relatively stiff faces, the facial expression changes little and the change amplitude is relatively small; changes in the mask portion and illumination may mislead the recognition method, etc., which is very disadvantageous for emotion recognition based on facial expressions. Because of these limitations, single-modal emotion recognition tends to be less accurate.

Therefore, according to the research of the inventor, the expression form of the emotion is not limited to the expression that the image can be acquired, and the expression form of the emotion is various and includes multiple modes such as voice, facial expression, body language, gesture and the like. Research shows that when human expresses emotion, the information transmitted by the expression of voice and face accounts for more than 93%, so that the voice and the face are the main modes and carriers for expressing emotion of people. The language can express the complex emotion of the human, and the tone (such as the intensity, speed and the like of the sound) of the language during expression can express the emotional state of the human more vividly and completely. The face is the most effective expression organ, and the components of the face are an organic whole which are coordinated and consistent so as to accurately express the same emotion.

based on the research, the embodiment of the application provides a training method of a bimodal emotion recognition model and a bimodal emotion recognition method, and the method can be used for recognition by combining various information output by human beings, so that the effect of improving the human emotion accuracy can be achieved.

example one

To facilitate understanding of the embodiment, first, an electronic device for executing the dual-modal emotion recognition model training method or the dual-modal emotion recognition method disclosed in the embodiment of the present application will be described in detail.

as shown in fig. 1, is a block schematic diagram of an electronic device. The electronic device 100 may include a memory 111, a memory controller 112, a processor 113, a peripheral interface 114, an input-output unit 115, and a display unit 116. It will be understood by those of ordinary skill in the art that the structure shown in fig. 1 is merely exemplary and is not intended to limit the structure of the electronic device 100. For example, electronic device 100 may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.

The above-mentioned elements of the memory 111, the memory controller 112, the processor 113, the peripheral interface 114, the input/output unit 115 and the display unit 116 are electrically connected to each other directly or indirectly, so as to implement data transmission or interaction. For example, the components may be electrically connected to each other via one or more communication buses or signal lines. The processor 113 is used to execute the executable modules stored in the memory.

the Memory 111 may be, but is not limited to, a Random Access Memory (RAM), a Read Only Memory (ROM), a Programmable Read-Only Memory (PROM), an Erasable Read-Only Memory (EPROM), an electrically Erasable Read-Only Memory (EEPROM), and the like. The memory 111 is configured to store a program, and the processor 113 executes the program after receiving an execution instruction, and the method executed by the electronic device 100 defined by the process disclosed in any embodiment of the present application may be applied to the processor 113, or implemented by the processor 113.

The processor 113 may be an integrated circuit chip having signal processing capability. The Processor 113 may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the Integrated Circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, a discrete gate or transistor logic device, or a discrete hardware component. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

the peripheral interface 114 couples various input/output devices to the processor 113 and memory 111. In some embodiments, the peripheral interface 114, the processor 113, and the memory controller 112 may be implemented in a single chip. In other examples, they may be implemented separately from the individual chips.

The input/output unit 115 is used to provide input data to the user. The input/output unit 115 may be, but is not limited to, a mouse, a keyboard, and the like.

the display unit provides an interactive interface (e.g., a user interface) between the electronic device 100 and a user or for displaying image data to a user reference. In this embodiment, the display unit may be a liquid crystal display or a touch display. In the case of a touch display, the display can be a capacitive touch screen or a resistive touch screen, which supports single-point and multi-point touch operations. The support of single-point and multi-point touch operations means that the touch display can sense touch operations simultaneously generated from one or more positions on the touch display, and the sensed touch operations are sent to the processor for calculation and processing.

the electronic device 100 in this embodiment may be configured to perform each step in each method provided in this embodiment. The following describes in detail the implementation process of the bimodal emotion recognition model training method and the bimodal emotion recognition method by several embodiments.

Example two

Please refer to fig. 2, which is a flowchart of a bimodal emotion recognition model training method provided in an embodiment of the present application. The specific process shown in fig. 2 will be described in detail below.

step 201, inputting voice training data into a first neural network model for training to obtain a voice emotion recognition model.

Optionally, step 201 may include: inputting voice training data into a first neural network model, and performing supervised training by adopting a combined loss function consisting of an Affinity loss function (Affinity loss) and a Focal loss function (Focal loss) to obtain a voice emotion recognition model.

Aiming at the problems of easy confusion of emotion and unbalanced emotion data types, the speech emotion recognition model references the thought of metric learning and uses the joint loss of Affinity loss and Focal loss as a loss function. Compared with the existing method only using cross entropy as a loss function, the combined loss function provided by the embodiment of the application improves the feature distinguishability in voice and relieves the problem of emotional data category imbalance.

in this embodiment, the first neural network model may be a Deep Neural Network (DNN) model including an input layer, a hidden layer, and an output layer. The input of DNN is sentence level feature vector, and the learning target is emotion label.

illustratively, the DNN model described above may include one input layer, four hidden layers, and an output layer. For example, the input layer may have a first set number of neurons, which may be equal to the dimensionality of the sentence-level feature vectors. The first hidden layer has 512 neurons, the second hidden layer has 256 neurons, the third hidden layer has 128 neurons, the fourth hidden layer has 64 neurons, and the output layer has a second set number of neurons, which may be equal to the number of emotion classes.

Illustratively, neurons in the hidden layer may employ Max Feature Map (MFM) as an activation function. The neuron activation function of the output layer may be softmax.

in one example, the initial learning rate may be set to 1e ^-3, the learning rate is reduced to the previous half every 10 iterations, and the batch size is thirty-two.

in one example, Low Level Descriptors (LLDs) may be extracted from the original audio frame by using a window with a sliding length of 25ms and a sliding step size of 10ms, and statistics of the Low Level Descriptors (LLDs) may be calculated to extract a feature vector at 384-dimensional sentence Level (Utterance Level) from the original audio. The speech emotion feature set is characterized by comprising the following steps: the method comprises the following steps of selecting characteristics such as initial consonant, vowel, frequency spectrum and sound quality, wherein the selected sixteen low-order description parameters respectively comprise: zero-crossing rate (ZCR), Root Mean Square Energy (RMS Energy), pitch frequency (pitch frequency) (denoted by F0 in Table 1 below), overtone-to-noise ratio (HNR), Mel frequency cepstral coefficients (1-12 dimensions) (MFCC 1-12). The twelve functionals employed include: mean (Mean), standard deviation (standard deviation), kurtosis (kurtosis), skewness (skewness), maximum and minimum (measures: value), relative position (relative position) and range (range), and two other linear regression coefficients (linear regression coefficients) and their Mean Square Error (MSE). For a low-order parameter, after 12 functional computations and the difference between the first and second order coefficients, the final feature contains 16 × 2 × 12 ═ 384 feature parameters.

the set of speech emotion features used can be as shown in table 1 below.

TABLE 1

Alternatively, the above feature vectors are normalized using a normalization method, so that vectors representing sentences are limited within a set length to eliminate differences in order of magnitude between dimensional data and make it easier for the network to converge. Illustratively, the above-mentioned normalization method may select Z-score normalization, decimal scaling, and the like.

illustratively, emotion categories can be divided into three categories: positive, neutral and negative.

step 202, inputting the image training data into a second neural network model, and performing supervised training of the first stage by using a first loss function to obtain an initial image emotion recognition model of the first stage.

Step 202 may include: and inputting the image training data into the second neural network model, and performing supervised training in the first stage by adopting a Cross-entropy Loss function (Cross-entropy Loss) to obtain an initial image emotion recognition model in the first stage.

Optionally, a face detector may also be used for face detection before inputting the image training data into the second neural network model. Then, the detected face region image is obtained by clipping. Furthermore, the face region image can be normalized, and the sizes of the face region image are normalized to be 299 x 299 image. The normalized face region image may be used as input data to the second neural network model described above.

Illustratively, the second neural network model may include an activation function including a linear rectification function, an optimizer including a stochastic gradient descent algorithm, and a loss function including a cross entropy loss function employed in the first model training stage and a focus loss function employed in the second model training stage.

step 203, inputting the image training data into the initial image emotion recognition model of the first stage, performing supervised training of the second stage by adopting a second loss function to obtain a target image emotion recognition model, and performing decision-level fusion on the voice emotion recognition model and the target image emotion recognition model to obtain an emotion recognition model.

Illustratively, in both stages, ReLU (rectified Linear units) is used as the activation function, the initial learning rate is set to 1e ^-2, the batch size is thirty-two, and SGD (stored gradient device) is used as the optimizer.

step 203 may comprise: inputting the image training data into the initial image emotion recognition model of the first stage, and performing supervised training of the second stage by adopting a Focal loss function (Focal loss) to obtain a target image emotion recognition model.

illustratively, the second Neural Network model may be a Deep Convolutional Neural Network (DCNN) model: the face depth emotion characteristics are extracted based on a depth convolutional neural network inclusion model, the DCNN model can comprise forty-seven convolutional layers, and the depth characteristics are fully extracted by utilizing a convolutional pooling parallel structure.

In one example, in the first stage, under the supervision of Cross-entry Loss, 2048-dimensional face depth emotion features can be extracted for each input face region image based on an inclusion model in the second neural network model.

In the second stage, under the supervision of Focal local, the network model obtained in the previous stage continues to extract 2048-dimensional face depth emotional characteristics for each input face region image, the obtained face emotional characteristics are more distinctive, and the final video emotion recognition classification result can be obtained through a full connection layer.

Research shows that the expression modes of children emotion and adult emotion are different, and related research needs to be carried out in a targeted mode. The published research results are investigated, and the voice emotion recognition research for children is few and few, and the recognition accuracy is not high. The reason is mainly two factors: on one hand, the children voice emotion database has small data volume and insufficient sample coverage, and the advantages of machine learning cannot be fully exerted; on the other hand, the emotion recognition model is not sufficiently studied, and only information of a single modality is generally utilized. With the increase of the life rhythm, the time and the emotional communication chances of parents and children are reduced, and the physical and mental health of children can not grow away from the full emotional communication and accompanying experience. The accompanying robot with the voice emotion recognition function can effectively supplement the short board, closely interacts with children, and can feed back the perceived emotion state of the children to parents in time, so that the accompanying robot has sufficient practical value. Based on the research, the emotion model identification method provided by the embodiment of the application can also establish corresponding databases for different crowds, so that the emotion identification model obtained by training in the embodiment of the application can be used for different crowds.

Therefore, on the basis shown in fig. 2, the method for emotion recognition model training in the embodiment of the present application may further construct a training database including the speech training data and the image training data.

Alternatively, as shown in fig. 3, the step of constructing the training database including the voice training data and the image training data may include the following steps.

and 204, recording the voice in the target environment by using an acoustic vector sensor, and coding the acquired voice signal by using pulse code modulation of the designated position to obtain an initial voice data set.

Alternatively, speech acquisition may be based on a new microphone array-Acoustic Vector Sensor (AVS).

illustratively, the speech is recorded using an Acoustic Vector Sensor (AVS). In one example, the speech signal has a sampling rate of 48khz, encoded using 16-bit PCM. The data in the data established in the embodiment of the present application may include three types of emotion, which are respectively: audio and video data with positive, neutral and negative emotions.

alternatively, audio data for three types of emotions may be collected relatively uniformly. In one example, the total duration of the audio/video data may be 8 hours and 45 minutes, wherein the duration of the positive audio/video data is 2 hours and 16 minutes, the duration of the neutral audio/video data is 3 hours and 2 minutes, and the duration of the negative audio/video data is 3 hours and 27 minutes.

Step 205, preprocessing the initial voice data set.

the pretreatment comprises the following steps: selecting one or more of complete speech data of the initial speech data set, removing noise from speech data in the initial speech data set, and removing silence data from speech data in the initial speech data set.

Alternatively, a Voice Activity Detection (VAD) method may be employed to select a complete sentence and obtain the corresponding audio and video data.

Alternatively, Low Level Descriptors (LLDs) may be extracted from the original audio by frames, and statistics thereof may be calculated, extracting feature vectors at sentence Level (Utterance Level).

Alternatively, Voice Activity Detection (VAD) may be used to perform voice detection, select a complete sentence, and remove noise and silence segments to obtain corresponding audio and video data.

And step 206, naming the preprocessed initial voice data according to a first set naming rule to obtain a voice training data set, wherein the voice training data is data in the voice training data set.

in the above example of acquiring audio and video with the duration of 8 hours and 45 minutes, 12911 segments of audio/video (audio and video have a one-to-one correspondence), which may be named in the same way. For example, < audio data, tag >, < video data, tag >. Wherein, actively: 3459 sentence, neutral: 4087, negative: 5365 (sentence).

Alternatively, the audio data and video data may be divided into two parts, a training set and a test set, where the training set has 24 people, 10632 segments of speech/video, where the positive: 2640, neutral: 3389, negative: 4603; the test set had 6 people, 2279 segment voice/video, with positive: 819, neutral: 698 sentence, negative: 762 sentence(s).

for speech emotion recognition, the training data used in the speech emotion recognition model training phase is the < audio data, tag > data pair of known emotion class tags. Only < Audio data > is needed during the speech emotion recognition model test phase, and < Audio data, tag > data pairs can be used to verify model performance.

And step 207, recording the video in the target environment to obtain initial video data.

and a video data acquisition subsystem based on Kinect. The Kinect is used for recording a video scene, the frame rate of a video signal is 15fps, and the picture size is 720 p.

for face emotion recognition, the training data used in the training phase of the target image emotion recognition model is the < image data, tag > data pair of the known emotion class tag. The test stage of the target image emotion recognition model only needs < picture data >.

and 208, correspondingly cutting the initial video data and the voice data in the voice training data set to obtain a video training data set.

The image training data is one or more frames of images in the video data in the video training data set, and the training database comprises the voice training data set and the video training data set.

in one example, if the trained emotion recognition model is used to recognize children's emotions, the training database may include 30 students aged 8 to 10 years old, in a ratio of 1:1, recorded in an office environment, with a signal sampling rate of 44.1kHz, 16 bits PCM encoded, and a total data duration of 8 hours and 45 minutes, 12911 sentences.

It will be appreciated that if the emotion recognition model is used to recognize the emotion of users of other ages, users of other ages may be selected to construct training data.

The method in the embodiment of the application adopts the dual functions of the speech emotion recognition model and the target image emotion recognition model to form the emotion recognition model in public, and further the training process of the target image emotion recognition model comprises the following steps: the first stage is called a 'feature separable' stage, the second stage is called a 'feature more distinguishing' stage, and the dual-stage model training can enable the emotion recognition accuracy of the emotion recognition model to be relatively higher.

further, training databases of different ages can be adaptively constructed, so that the emotion recognition model in the embodiment of the application has higher flexibility.

EXAMPLE III

Based on the same application concept, an emotion recognition model training device corresponding to the emotion recognition model training method is further provided in the embodiment of the application, and as the principle of solving the problem of the device in the embodiment of the application is similar to that of the emotion recognition model training method in the embodiment of the application, the implementation of the device can refer to the implementation of the method, and repeated details are omitted.

Please refer to fig. 4, which is a functional block diagram of an emotion recognition model training apparatus according to an embodiment of the present application. Each module in the emotion recognition model training apparatus in this embodiment is used to execute each step in the above method embodiments. The emotion recognition model training device includes: a first training module 301, a second training module 302, and a third training module 303; wherein,

The first training module 301 is configured to input voice training data into a first neural network model for training to obtain a voice emotion recognition model;

The second training module 302 is configured to input the image training data into a second neural network model, and perform supervised training in the first stage by using the first loss function to obtain an initial image emotion recognition model in the first stage;

the third training module 303 is configured to input the image training data into the initial image emotion recognition model in the first stage, perform supervised training in the second stage by using a second loss function to obtain a target image emotion recognition model, and perform decision-level fusion on the speech emotion recognition model and the target image emotion recognition model to obtain a bimodal emotion recognition model.

in a possible implementation manner, the second training module 302 is further configured to:

The third training module 303 is further configured to:

in one possible embodiment, the second neural network model comprises an activation function comprising a linear rectification function, an optimizer comprising a stochastic gradient descent algorithm.

in a possible implementation, the first training module 301 is further configured to:

In one possible implementation, the first neural network model includes an input layer, a hidden layer, an output layer, and an optimizer, wherein the hidden layer and the output layer include activation functions, the activation functions of the hidden layer include a maximum feature mapping function, the activation functions of the output layer include a softmax function, and the optimizer includes a RMSProp function.

In a possible implementation manner, in this embodiment, the emotion recognition model training apparatus further includes: a construction module 304, configured to construct a training database including the voice training data and the image training data.

In one possible implementation, the building module 304 is further configured to:

Recording the video in the target environment to obtain initial video data;

Example four

Please refer to fig. 5, which is a flowchart illustrating an emotion recognition method according to an embodiment of the present application. The specific flow shown in fig. 5 will be described in detail below.

step 401, acquiring voice data generated by a target user in a target time period.

The above-mentioned target period may be a period that needs to be identified. The target user may be a student in distance education, a driver or a passenger using a vehicle-mounted system, a user using a home service robot, a depression patient in clinical medicine, an autistic child, or the like.

Alternatively, the above-described voice data may be acquired using a novel microphone array-Acoustic Vector Sensor (AVS).

Step 402, acquiring video data of the target user in the target time period.

Alternatively, the video data described above may be captured using Kinect.

And 403, recognizing the voice data by using a voice emotion recognition model to obtain a first emotion recognition result.

the speech emotion recognition model in this embodiment may be the speech emotion recognition model obtained by training in the second embodiment.

And step 404, identifying each picture in the video data by using a target image emotion identification model to obtain an emotion identification result of each picture.

The target image emotion recognition model in this embodiment may be the target image emotion recognition model obtained by training in the second embodiment.

step 405, determining a second emotion recognition result according to the emotion recognition result of each image in the video data.

Illustratively, the probability of each image in the video data is averaged to obtain the second emotion recognition result of the video data.

In the following, the emotion classification is exemplarily described as three types, and the second emotion recognition result is represented as:

S_video＝{s_1video,s_2video，s_3video}；

S _video represents the second emotion recognition result corresponding to the video data, S _1video represents the probability that the video data is recognized as a first emotion, S _2video represents the probability that the video data is recognized as a second emotion, and S _3video represents the probability that the video data is recognized as a third emotion.

for example, the video data may include n images, and the recognition result for each image is represented as S _face1 ═ { S _1face1, S _2face1, S _3face1 }, S _face2 ═ S _1face2, S _2face2, S _3face2 }, …, S _facen ═ S _1facen, S _2facen, S _3facen }.

the second emotion recognition result can be expressed as:

S_video＝{s_1face1,s_2face1，s_3face1}/n+{s_1face2,s_2face1，s_3face2}/n+、、、+{s_1facen,s_2facen，s_3facen}/n。

And 406, determining an emotion recognition result of the target user according to the first emotion recognition result and the second emotion recognition result.

In the embodiment, decision fusion is performed on the first emotion recognition result and the second emotion recognition of the two modes of voice and video by using a weighted summation method, and the emotion types are predicted through the fused emotion probability distribution to obtain the final probabilities of the three types of emotions.

The first emotion recognition result is a first probability matrix formed by probability values corresponding to all emotion classifications, and the second emotion recognition result is a second probability matrix formed by probability values corresponding to all emotion classifications; as shown in fig. 6, step 406 may include the following steps.

Step 4061, performing weighted summation on the first probability matrix and the second probability matrix to determine the emotion probability matrix of the target user.

The first probability matrix is weighted with a first weight and the second probability matrix is weighted with a second weight. And summing the weighted first probability matrix and the weighted second probability matrix to obtain the emotion probability matrix of the target user.

alternatively, the first weight may be equal to the second weight.

step 4062, determining the current emotion type of the target user according to the emotion probability matrix.

illustratively, the following is expressed by several formulas:

S＝α*S_video+β*S_audio；

Where S denotes an emotion probability distribution of the target user in the target time slot, S _audio denotes a first emotion recognition result, S _video denotes a second emotion recognition result corresponding to the video data, and α and β denote a first weight and a second weight, alternatively, α + β ═ 1, in one example, α ═ β ═ 0.5, and of course, the values of α and β may be other values, for example, α ═ 0.4, β ═ 0.6, and the like.

And further, according to the determined S { (S ₁, S ₂, S ₃ }. the emotion category corresponding to the maximum value in S ₁, S ₂, S ₃ is used as the emotion recognition result of the target user.

in the following table 2, there is a comparison table of the accuracy rate of determining emotion by using the speech emotion recognition model, the accuracy rate of determining emotion by using the image emotion recognition model, and the accuracy rate of determining emotion by using the dual-mode emotion recognition model provided in the present embodiment.

TABLE 2

wherein WA represents the first accuracy and UA represents the second accuracy. WA equals the number of correctly classified samples divided by the total number of samples, UA equals the number of correctly classified samples per class divided by the total number of samples for each class and summed.

As can be understood from the above table, the accuracy of the recognition by using the bimodal emotion recognition model is higher than the accuracy of the emotion determined by the speech emotion recognition model and the accuracy of the emotion determined by the image emotion recognition model.

EXAMPLE five

Based on the same application concept, an emotion recognition device corresponding to the emotion recognition method is further provided in the embodiment of the present application, and as the principle of solving the problem of the device in the embodiment of the present application is similar to the emotion recognition method in the embodiment of the present application, the implementation of the device can refer to the implementation of the method, and repeated details are omitted.

please refer to fig. 7, which is a schematic diagram of functional modules of an emotion recognition apparatus according to an embodiment of the present application. Each module in the emotion recognition apparatus in this embodiment is configured to perform each step in the above method embodiment. The emotion recognition device includes: a first obtaining module 501, a second obtaining module 502, a first identifying module 503, a second emotion identifying module 504, a first determining module 505 and a second determining module 506, wherein;

A first obtaining module 501, configured to obtain voice data generated by a target user in a target time period;

A second obtaining module 502, configured to obtain video data of the target user in the target time period;

A first recognition module 503, configured to use the speech emotion recognition model to recognize the speech data, so as to obtain a first emotion recognition result;

A second identification module 504, configured to identify each picture in the video data by using the target image emotion identification model, so as to obtain an image emotion identification result of each picture;

A first determining module 505, configured to determine a second emotion recognition result according to the image emotion recognition result of each image;

A second determining module 506, configured to determine an emotion recognition result of the target user according to the first emotion recognition result and the second emotion recognition result.

in one possible implementation manner, the first emotion recognition result is a first probability matrix formed by probability values corresponding to the respective emotion classifications, and the second emotion recognition result is a second probability matrix formed by probability values corresponding to the respective emotion classifications; the second determination module 506 includes: a calculation unit and a determination unit;

the computing unit is configured to perform weighted summation on the first probability matrix and the second probability matrix to determine an emotion probability matrix of the target user;

The determining unit is configured to determine the current emotion type of the target user according to the emotion probability matrix.

In a possible implementation, the computing unit is further configured to:

In addition, an embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program performs the steps of the emotion recognition model training method or the emotion recognition method described in the above method embodiments.

The emotion recognition model training method and the computer program product of the emotion recognition method provided in the embodiments of the present application include a computer readable storage medium storing a program code, where instructions included in the program code may be used to execute the emotion recognition model training method or the emotion recognition method described in the above method embodiments, which may be specifically referred to in the above method embodiments and are not described herein again.

in the embodiments provided in the present application, it should be understood that the disclosed apparatus and method can be implemented in other ways. The apparatus embodiments described above are merely illustrative, and for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In addition, functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.

The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes. It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application. It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

the above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. a training method of a bimodal emotion recognition model is characterized by comprising the following steps:

2. the method of claim 1, wherein the step of inputting the image training data into a second neural network model, and performing a first stage of supervised training using a first loss function to obtain a first stage of initial image emotion recognition model comprises:

3. The method of claim 1, wherein the step of inputting speech training data into the first neural network model for training to obtain the speech emotion recognition model comprises:

4. The method of claim 1, further comprising:

recording the video in the target environment to obtain initial video data;

And correspondingly cutting the initial video data and the voice data in the voice training data set to obtain a video training data set, wherein the image training data is one or more frames of images in the video data in the video training data set, and the training data set comprises the voice training data set and the video training data set.

5. A bimodal emotion recognition method, comprising:

Acquiring voice data generated by a target user in a target time period;

Acquiring video data of the target user in the target time period;

recognizing the voice data by using the voice emotion recognition model of any one of claims 1-4 to obtain a first emotion recognition result;

Identifying each picture in the video data by using the target image emotion identification model of any one of claims 1 to 4 to obtain an image emotion identification result of each picture;

And determining the emotion recognition result of the target user according to the decision level fusion of the first emotion recognition result and the second emotion recognition result.

6. the method of claim 5, wherein the first emotion recognition result is a first probability matrix formed by probability values corresponding to each emotion classification, and the second emotion recognition result is a second probability matrix formed by probability values corresponding to each emotion classification; the step of determining the emotion recognition result of the target user according to the first emotion recognition result and the second emotion recognition result comprises the following steps:

7. The method of claim 6, wherein the step of determining the emotion probability matrix for the target user by weighted summation of the first probability matrix and the second probability matrix comprises:

8. a bimodal emotion recognition model training device is characterized by comprising:

And the third training module is used for inputting the image training data into the initial image emotion recognition model in the first stage, adopting a second loss function to perform supervised training in the second stage so as to obtain a target image emotion recognition model, and performing decision-level fusion on the voice emotion recognition model and the target image emotion recognition model so as to obtain a bimodal emotion recognition model.

9. A bimodal emotion recognition apparatus, comprising:

a first recognition module, configured to use the speech emotion recognition model according to any one of claims 1-4 to recognize the speech data, so as to obtain a first emotion recognition result;

A second identification module, configured to identify each picture in the video data by using the target image emotion identification model according to any one of claims 1 to 4, to obtain an image emotion identification result of each picture;

10. an electronic device, comprising: a processor, a memory storing machine-readable instructions executable by the processor, the machine-readable instructions when executed by the processor performing the steps of the method of any one of claims 1 to 7 when the electronic device is run.