CN111681645B

CN111681645B - Emotion recognition model training method, emotion recognition device and electronic equipment

Info

Publication number: CN111681645B
Application number: CN201910141010.XA
Authority: CN
Inventors: 何亚豪; 蒋栋蔚; 韩堃
Original assignee: Beijing Didi Infinity Technology and Development Co Ltd
Current assignee: Beijing Didi Infinity Technology and Development Co Ltd
Priority date: 2019-02-25
Filing date: 2019-02-25
Publication date: 2023-03-31
Anticipated expiration: 2039-02-25
Also published as: CN111681645A

Abstract

The application provides an emotion recognition model training method, an emotion recognition device and electronic equipment, wherein the method comprises the following steps: acquiring user data, wherein the user data comprises voice data; converting each piece of voice data into text data; fusing each piece of user data and text data converted from voice data contained in the user data into a piece of training feature, and forming training data by the obtained training features corresponding to all user data; and inputting the training data into an initial network model for training to obtain an emotion recognition model. According to the emotion recognition model training method and device, the training data are formed through the voice and the text converted by the voice to be trained to obtain the emotion recognition model, the emotion recognition model can be made to be stronger in adaptability, and the recognition effect of the trained model is better.

Description

Emotion recognition model training method, emotion recognition device and electronic equipment

Technical Field

The application relates to the technical field of data processing, in particular to a method and a device for training an emotion recognition model and electronic equipment.

Background

The popularity of the contract service brings about convenience in life, but is accompanied by some problems of uncertainty in the user's situation. In the prior art, authentication of a user is realized by verifying identity information of a multi-party user so as to know the condition of the user; but the identity information does not represent the status of the user.

Disclosure of Invention

In view of this, an embodiment of the present application aims to provide an emotion recognition model training method, an emotion recognition device, and an electronic device, which can form training data through two types of features of a speech and a speech-to-speech converted text to train to obtain an emotion recognition model, so that the emotion recognition model has stronger adaptability, and the trained model has a better recognition effect.

According to one aspect of the present application, an electronic device is provided that may include one or more storage media and one or more processors in communication with the storage media. One or more storage media store machine-readable instructions that are executable by a processor. When the electronic device is operated, the processor communicates with the storage medium through the bus, and the processor executes the machine readable instructions to perform one or more of the following operations:

acquiring user data, wherein the user data comprises voice data;

converting each piece of voice data into text data;

fusing each piece of user data and text data converted from voice data contained in the user data into a piece of training feature, and forming training data by the obtained training features corresponding to all user data;

and inputting the training data into an initial network model for training to obtain an emotion recognition model.

According to the emotion recognition model training method, the emotion recognition model is obtained by training the training data formed by the speech and the text converted by the speech, and compared with the method for training the model by adopting a single data type in the prior art, the emotion recognition model can be more adaptive, and the recognition effect of the trained model is better. In addition, the recognition of the corresponding user state can be realized through the trained user emotion recognition model.

In some embodiments, the step of fusing each piece of the user data and text data converted from speech data included in the user data into a piece of training feature, and forming training data from training features corresponding to all obtained user data includes:

performing feature extraction on each piece of text data to obtain text features;

extracting the characteristics of each piece of voice data to obtain voice characteristics;

and fusing any one voice feature and the text feature corresponding to any one voice feature to obtain a training feature, and forming training data by the obtained training features corresponding to all user data.

In some embodiments, the step of extracting features of each piece of text data to obtain text features includes:

mapping the text data to a hyperplane space to form a text point in the hyperplane space;

coding the text points on a first convolution network to obtain a first intermediate characteristic;

and extracting important features from the first intermediate features through maximum value pooling to obtain text features.

In some embodiments, the step of encoding the text point on the first convolutional network to obtain a first intermediate feature includes:

processing the text points in a first convolution network to obtain a first original characteristic;

processing the first original feature through an attention mechanism to obtain a first attention feature;

and carrying out weighting processing on the first original feature and the first attention feature to obtain a first intermediate feature.

In some embodiments, the step of extracting features of each piece of speech data to obtain speech features includes:

processing the voice data through a second convolution network to obtain a second intermediate characteristic;

inputting the second intermediate features into a long-short term memory model network, and identifying the context dependency of the second intermediate features;

and extracting the important features of the second intermediate feature type through maximum pooling to obtain the voice features.

In some embodiments, the step of processing the voice data through a second convolutional network to obtain a second intermediate feature includes:

processing the voice data in a second convolution network to obtain a second original characteristic;

processing the second original characteristic through an attention mechanism to obtain a second attention characteristic;

and performing weighting processing on the second original feature and the second attention feature to obtain a second intermediate feature.

In some embodiments, the step of fusing any one of the speech features and the text feature corresponding to the any one of the speech features to obtain a training feature includes:

and splicing any one voice feature and the text feature corresponding to any one voice feature to form a training feature.

In some embodiments, the step of splicing any one of the speech features with the text feature corresponding to the any one of the speech features to form a training feature includes:

supplementing a set value of a set quantity with each text feature and each voice feature to obtain a supplemented text feature and a supplemented voice feature;

and performing outer product on any supplementary text feature and the voice feature corresponding to any supplementary text feature to obtain a training feature.

In some embodiments, the user data further includes image data, and the step of fusing each piece of the user data and text data converted from voice data included in the user data into one training feature, where the obtained training features corresponding to all pieces of the user data form training data, includes:

and fusing any image data, voice data corresponding to any image data and text data converted from the voice data to form a training feature, and forming training data by the obtained training features corresponding to all user data.

According to the method in the embodiment, the image data is added into the training data, the type of the model training data is added, the relation between various types of data can be trained, the adaptability of the trained model is improved, and therefore the recognition success rate of the trained emotion recognition model is higher.

carrying out feature extraction on the image data to obtain image features;

and fusing any text feature with the voice feature and the image feature corresponding to the text feature to obtain a training feature, and forming training data by the obtained training features corresponding to all user data.

In some embodiments, the step of extracting the features of the image data to obtain image features includes:

processing the image data through a third convolution network to obtain a third intermediate characteristic;

inputting the third intermediate features into a long-short term memory model network, identifying context dependencies in the third intermediate features;

and extracting the important features of the third intermediate feature type through maximum value pooling processing to obtain the voice features.

In some embodiments, the step of processing the image data through a third convolutional network to obtain a third intermediate feature includes:

processing the image data in a third convolution network to obtain a third original characteristic;

processing the third original characteristic through an attention mechanism to obtain a third attention characteristic;

and performing weighting processing on the third original feature and the third attention feature to obtain a third intermediate feature.

In some embodiments, the step of inputting the training data into an initial network model for training to obtain an emotion recognition model includes:

inputting the training data into a latest training model for calculation to obtain an initial calculation result;

calculating the initial calculation result and a labeling result corresponding to the training data to obtain a current error of the current model;

if the current error is larger than a set value, adjusting parameters in the initial network model in a set calculation mode, and updating the training model;

and if the current error is smaller than the set value, taking the training model corresponding to the current error smaller than the set value as the emotion recognition model.

In some embodiments, the step of calculating the initial calculation result and the labeling result corresponding to the training data to obtain the current error of the current model includes: calculating the initial calculation result and a labeling result corresponding to the training data in a mode of applying a penalty item, and calculating to obtain a current error of the current model; alternatively, the first and second liquid crystal display panels may be,

the step of inputting the training data into the latest training model for calculation to obtain an initial calculation result comprises the following steps: and inputting the training data into the latest training model for calculation in a mode of applying a penalty item to obtain an initial calculation result.

According to the method in the embodiment, the generalization capability of the model can be improved by applying the penalty term in the training process.

In another aspect, an embodiment of the present application further provides an emotion recognition method, including:

acquiring current user data of a target user;

and inputting the current user data into the emotion recognition model for recognition to obtain the current state of the target user.

In some embodiments, the method further comprises:

and if the current state represents that the target user is in an unsafe state, generating a prompt message, and sending the prompt message to a target user terminal or an associated platform.

In another aspect, an embodiment of the present application further provides an emotion recognition model training apparatus, including:

the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring user data which comprises voice data;

the conversion module is used for converting each piece of voice data into text data;

the fusion module is used for fusing each piece of user data and text data converted from voice data contained in the user data into a piece of training feature, and the obtained training features corresponding to all the user data form training data;

and the training module is used for inputting the training data into an initial network model for training to obtain an emotion recognition model.

In some embodiments, the fusion module is further configured to:

performing feature extraction on each piece of voice data to obtain voice features;

and fusing any one voice feature with the text feature corresponding to the any one voice feature to obtain a training feature, and forming training data by the training features corresponding to all the obtained user data.

In some embodiments, the fusion module is further configured to:

and extracting the important features of the second intermediate feature type through maximum value pooling processing to obtain the voice features.

In some embodiments, the fusion module is further configured to:

In some embodiments, the user data further comprises image data, the fusion module further to:

and fusing any image data, voice data corresponding to any image data and text data converted from the voice data to form a training feature, and forming training data by using the training features corresponding to all the obtained user data.

In some embodiments, the fusion module is further configured to:

carrying out feature extraction on the image data to obtain image features;

and fusing any text feature with the voice feature and the image feature corresponding to the text feature to obtain a training feature, and forming training data by the training features corresponding to all the obtained user data.

In some embodiments, the fusion module is further configured to:

inputting the third intermediate features into a long-short term memory model network, and identifying the dependency relationship of the context in the third intermediate features;

In some embodiments, the fusion module is further configured to:

In some embodiments, the training module is further configured to:

calculating the initial calculation result and a labeling result corresponding to the training data in a mode of applying a penalty item, and calculating to obtain a current error of the current model; alternatively, the first and second electrodes may be,

and inputting the training data into the latest training model for calculation in a mode of applying a penalty item to obtain an initial calculation result.

In another aspect, an embodiment of the present application further provides an emotion recognition apparatus, including:

the second acquisition module is used for acquiring the current user data of the target user;

and the identification module is used for inputting the current user data into the emotion identification model for identification to obtain the current state of the target user.

In some embodiments, the apparatus further comprises:

and the prompting module is used for generating a prompting message if the current state represents that the target user is in an unsafe state, and sending the prompting message to a target user terminal or a related platform.

In another aspect, the present application further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and the computer program is executed by a processor to perform the steps of the emotion recognition model training method in the above-mentioned embodiments.

In another aspect, the present application further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and the computer program is executed by a processor to perform the steps of the emotion recognition method in the above-mentioned embodiments.

In order to make the aforementioned objects, features and advantages of the present application more comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained from the drawings without inventive effort.

Fig. 1 shows a schematic structural diagram of an electronic device provided in an embodiment of the present application;

FIG. 2 is a flowchart illustrating a method for training a emotion recognition model provided in an embodiment of the present application;

FIG. 3 is a diagram illustrating a training model in an example provided by an embodiment of the present application;

fig. 4 shows a flowchart of an emotion recognition method provided by an embodiment of the present application;

FIG. 5 is a schematic structural diagram of an emotion recognition model training device provided in an embodiment of the present application;

fig. 6 shows a schematic structural diagram of an emotion recognition device provided in an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. The components of the embodiments of the present application, as generally described and illustrated in the figures herein, could be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present application without making any creative effort, shall fall within the protection scope of the present application.

Currently, the contract service becomes an important part of our lives, but there is some uncertainty about the service party and the served party in the contract service. Thus, there may be some security issues with the network contract service. For example, in a web appointment service, the driver and the passenger may be in a small space, and both the driver and the passenger may have some unknown actions or emotions, which may indicate the current state of the user. However, in the prior art, the states of the service party and the served party in the network contract service process are not identified.

Based on this, the inventor has studied the monitoring of users of both the service side and the served side in the network contract service, and has proposed that the monitoring of the state of the user can be realized by monitoring the voice messages generated by both the users and recognizing the voice messages. However, the voice message is single and may not well represent the status of the user. Based on this, the inventor has further studied that a voice message can be converted into a text message, so that data of multiple modalities can be recognized, thereby improving the accuracy of recognition of the state of the user.

The results of the studies of the inventors are described in detail below by way of a number of examples.

To enable those skilled in the art to use the present disclosure, the following embodiments are given in conjunction with a specific application scenario "a car booking service". It will be apparent to those skilled in the art that the general principles defined herein may be applied to other embodiments and applications without departing from the spirit and scope of the application. Although the present application is described primarily in the context of a network appointment service, it should be understood that this is only one exemplary embodiment. The application can be applied to any other traffic type. For example, the present application may be applied to different transportation system environments, including terrestrial, marine, or airborne, among others, or any combination thereof. The vehicle of the transportation system may include a taxi, a private car, a windmill, a bus, a train, a bullet train, a high speed rail, a subway, a ship, an airplane, a spacecraft, a hot air balloon, or an unmanned vehicle, etc., or any combination thereof. The present application may also include any service system for presence contract service or both-party service, for example, a system for sending and/or receiving express delivery, a service system for business transactions. Applications of the system or method of the present application may include web pages, plug-ins for browsers, client terminals, customization systems, internal analysis systems, or artificial intelligence robots, among others, or any combination thereof.

It should be noted that in the embodiments of the present application, the term "comprising" is used to indicate the presence of the features stated hereinafter, but does not exclude the addition of further features.

It is noted that before the application is filed, only the driver or passenger is authenticated and the like to realize the knowledge of the service provider and the service requester. However, the emotion recognition model training method, the emotion recognition method and the emotion recognition device provided by the application can recognize the emotion of the user or provide a model for recognizing the emotion of the user.

Example one

Fig. 1 illustrates a schematic diagram of exemplary hardware and software components of an electronic device 100, according to some embodiments of the present application. E.g., a processor of an electronic device, for performing the functions of the present application.

The electronic device 100 may be a general-purpose computer or a special-purpose computer, both of which may be used to implement the emotion recognition model training method or emotion recognition method of the present application. Although only a single computer is shown, for convenience, the functions described herein may be implemented in a distributed fashion across multiple similar platforms to balance processing loads.

For example, the electronic device 100 may include a network port 110 connected to a network, one or more processors 120 for executing program instructions, a communication bus 130, and a storage medium 140 of different form, such as a disk, ROM, or RAM, or any combination thereof. Illustratively, a computer platform may also include program instructions stored in ROM, RAM, or other types of non-transitory storage media, or any combination thereof. The method of the present application may be implemented in accordance with these program instructions. The electronic device 100 also includes an Input/Output (I/O) interface 150 between the computer and other Input/Output devices (e.g., keyboard, display screen).

For ease of illustration, only one processor is depicted in electronic device 100. However, it should be noted that the electronic device 100 in the present application may also comprise a plurality of processors, and thus the steps performed by one processor described in the present application may also be performed by a plurality of processors in combination or individually. For example, if the processor of the electronic device 100 executes steps a and B, it should be understood that steps a and B may also be executed by two different processors together or separately in one processor. For example, a first processor performs step a and a second processor performs step B, or the first processor and the second processor perform steps a and B together.

Example two

The embodiment provides a method for training an emotion recognition model. The method in this embodiment may be performed by an electronic device. FIG. 2 shows a flow diagram of a method for emotion recognition model training in an embodiment of the present application. The following describes in detail the flow of the emotion recognition model training method shown in fig. 2.

Step S201, user data is acquired.

The user data may include voice data. The user data may be history data generated in a setting application environment; or may be user data generated according to a set rule.

In one example, the user data may be data generated during a network appointment service. In a network appointment service scenario, the user data may be data generated by the driver, e.g., the voice of the driver speaking; the user data may also be data generated by passengers; the user data may also be dialogue data generated by the driver and the passenger.

In another example, the user data may be data generated during a takeaway service.

Step S202, each piece of voice data is converted into text data.

Step S203, fusing each piece of user data and text data converted from the voice data included in the user data into a piece of training feature, and forming training data from the training features corresponding to all the obtained user data.

In one embodiment, if the user data includes voice data, each piece of voice data and text data converted from the voice data are synthesized into one training feature, and how many pieces of voice data can form how many training features. The number of training data described above may be the same as the number of speech data.

In some embodiments, the step S203 may include:

step S2031, performing feature extraction on each piece of text data to obtain text features.

In some embodiments, step S2031 may include: mapping the text data to a hyperplane space to form a text point in the hyperplane space; coding the text point on a first convolution network to obtain a first intermediate characteristic; and extracting important features from the first intermediate features through maximum value pooling to obtain text features.

Specifically, in an example, please refer to a schematic diagram of a network model used in the process of performing the emotion recognition model training method shown in fig. 3, and for the implementation of step S2301, refer to a schematic diagram of a processing path of TEXT; the implementation process of step S2031 may be: firstly, inputting text data, mapping the text data to a hyperplane space through processing of an EMBEDDING layer, and forming a text point in the hyperplane space; performing encoding operation on the CONV layer to obtain a first intermediate characteristic; and finally, extracting important features through MAX POOLING of the maximum value of the MAX POOLING layer to obtain text features.

Wherein, MAX boosting is to maximize the feature points in the neighborhood. The method can reduce errors caused by the deviation of the estimated mean value due to the parameter errors of the convolutional layer, and more texture information is reserved.

The hyperplane space dimension is smaller than the subspace of the environment space. In an alternative embodiment, the hyperplane space described above may be selected as a high-dimensional dense hyperplane space.

After the text point is processed in the first convolution network, the first original feature may include: and inputting a text point into the two layers of convolution networks for processing, and then inputting a processing result into the time sequence network for processing to obtain the first intermediate characteristic.

The time sequence network may be an LSTM (long-short term memory model).

Based on the implementation of step S2031, further research may be performed, and an Attention (Attention) mechanism may be added after the encoding operation, so that a new feature may be obtained according to an original feature through the Attention mechanism, and the feature obtained through the weighting processing may better characterize the content that needs to be expressed by the feature by performing the weighting processing on the original feature and the new feature.

The above-mentioned encoding operation of the text point on the first convolutional network to obtain the first intermediate feature may include: processing the text points in a first convolution network to obtain a first original characteristic; processing the first original feature through an attention mechanism to obtain a first attention feature; and carrying out weighting processing on the first original feature and the first attention feature to obtain a first intermediate feature.

The first intermediate feature may be obtained by weighted summation of the first original feature and the first attention feature. Wherein the weight of the first original feature and the weight of the first attention feature may be set to 0.7 and 0.3, respectively; of course, it can be set to 0.8 and 0.2; it may also be set to 0.6 and 0.4. The specific weight can be set according to specific requirements.

Step S2032, extracting the characteristics of each piece of voice data to obtain voice characteristics.

In some embodiments, step S2032 may include: processing the voice data through a second convolution network to obtain a second intermediate characteristic; inputting the second intermediate features into a long-short term memory model network, and identifying the context dependency of the second intermediate features; and extracting the important features of the second intermediate feature type through maximum value pooling processing to obtain the voice features.

In one example, fbank features may be extracted from the speech data.

Based on the implementation of step S2032, further research may be performed, and an Attention (Attention) mechanism may be added after the encoding operation, so that a new feature may be obtained according to an original feature through the Attention mechanism, and the feature obtained through the weighting processing may better characterize the content that needs to be expressed by the feature by performing the weighting processing on the original feature and the new feature.

Specifically, in an example, referring to the schematic of the processing path of AUDIO shown in fig. 3, step S2032 may be implemented as: voice data is input into a second convolutional network, namely a CONV layer in the diagram 3 is processed to obtain a second intermediate characteristic; the second intermediate feature may be input to an LSTM layer for recognition, so as to recognize the context dependency relationship of the second intermediate feature, and then further input to a MAX boosting layer for processing, so as to obtain a speech feature capable of expressing the feature in a piece of speech data.

In some embodiments, the processing the voice data through the second convolutional network to obtain the second intermediate feature may include: processing the voice data in a second convolution network to obtain a second original characteristic; processing the second original characteristic through an attention mechanism to obtain a second attention characteristic; and performing weighting processing on the second original feature and the second attention feature to obtain a second intermediate feature.

The weighted summation of the second original feature and the second attention feature may result in a second intermediate feature. Wherein the weight of the second original feature and the weight of the second attention feature can be set to 0.7 and 0.3, respectively; of course, it can be set to 0.8 and 0.2; it may also be set to 0.6 and 0.4. The specific weight may be set according to specific requirements.

Step S2033, any one of the voice features and the text feature corresponding to the any one of the voice features are fused to obtain a training feature, and the obtained training features corresponding to all the user data form training data.

Step S2033 may include: and splicing any one voice feature and the text feature corresponding to any one voice feature to form a training feature.

In an embodiment, the above splicing any one of the speech features with the text feature corresponding to the any one of the speech features to form one training feature may include: supplementing a set value of a set quantity with each text feature and each voice feature to obtain a supplemented text feature and a supplemented voice feature; and performing outer product on any supplementary text feature and the voice feature corresponding to the supplementary text feature to obtain the training feature.

The above-mentioned set number may be 1, 2, etc. as necessary. The set number may also be a difference between the two features, for example, if the speech feature is 3 longer than the text feature length, the set number corresponding to the text feature may be 3, and the set number corresponding to the speech feature may be 0.

The above-mentioned set value may be a binary number of 1.

In one example, the speech features and the text features (for example, vectors of 10 × 1) corresponding to any one of the speech features may be subjected to outer product, and after "1" is added to each feature vector, the result is multiplied to obtain an 11 × 11 matrix.

In another embodiment, the above splicing any speech feature with the text feature corresponding to the any speech feature to form a training feature may include: any speech feature and the text feature corresponding to the speech feature can be directly combined into a training feature.

The prediction effect of the trained emotion recognition model can be improved by splicing the features.

The user data may further include image data, and step S203 may include: and fusing any image data, voice data corresponding to any image data and text data converted from the voice data to form a training feature, and forming training data by using the training features corresponding to all the obtained user data.

Specifically, the fusion into a training feature may be a fusion of a piece of speech data and text data converted from the speech data, that is, the fusion obtained training feature is also a fusion of related data of the same speech.

By adding the image data in the training data and the types of the model training data, the relation before various data can be trained, and the adaptability of the trained model is improved, so that the recognition success rate of the trained emotion recognition model is higher.

The user data described above may further include image data, and step S203 may include: performing feature extraction on the image data to obtain image features; and fusing any text feature with the voice feature and the image feature corresponding to the text feature to obtain a training feature, and forming training data by the training features corresponding to all the obtained user data.

The step of extracting the features of the image data to obtain the image features includes: processing the image data through a third convolution network to obtain a third intermediate characteristic; inputting the third intermediate features into a long-short term memory model network, identifying context dependencies in the third intermediate features; and extracting the important features of the third intermediate feature type through maximum value pooling processing to obtain the voice features.

Specifically, in an example, referring to the schematic diagram in the IMAGE processing path shown in fig. 3, the above-mentioned performing feature extraction on the IMAGE data to obtain an IMAGE feature may be implemented as follows: image data is input into a third convolution network, namely a CONV layer in the diagram 3 is processed to obtain a third intermediate characteristic; the second intermediate feature may be input to an LSTM layer for recognition, so as to recognize the context dependency of the third intermediate feature, and then further input to a MAX boosting layer for processing, so as to obtain a speech feature capable of expressing the feature in a piece of speech data.

Based on the above implementation manner of extracting the features of the image data to obtain the image features, further research can be performed, an Attention (Attention) mechanism can be added after the encoding operation, a new feature can be obtained according to an original feature through the Attention mechanism, and the features obtained through weighting processing can better represent the contents to be expressed by the features through weighting processing the original feature and the new feature.

In some embodiments, the step of processing the image data through a third convolutional network to obtain a third intermediate feature includes: processing the image data in a third convolution network to obtain a third original characteristic; processing the third original characteristic through an attention mechanism to obtain a third attention characteristic; and performing weighting processing on the third original feature and the third attention feature to obtain a third intermediate feature.

The weighted summation of the third original feature and the third attention feature may result in a third intermediate feature. Wherein the weight of the third original feature and the weight of the third attention feature can be set to 0.7 and 0.3, respectively; of course, it can be set to 0.8 and 0.2; and may be set to 0.6 and 0.4. The specific weight may be set according to specific requirements.

By adding the attention mechanism, the model of the training process can be more accurate.

After feature extraction is performed on the image data, the voice data, and the text data, the FUSION layer shown in fig. 3 may be used to perform FUSION, so as to obtain training features including features carried by each of the image data, the voice data, and the text data.

And step S204, inputting the training data into an initial network model for training to obtain an emotion recognition model.

The initial network model described above may be a multilayer fed forward neural network (multi layer fed forward neural network).

The step S204 may include: inputting the training data into a latest training model for calculation to obtain an initial calculation result; calculating the initial calculation result and a labeling result corresponding to the training data to obtain a current error of the current model; if the current error is larger than a set value, adjusting parameters in the initial network model in a set calculation mode, and updating the training model; and if the current error is smaller than the set value, taking the training model corresponding to the current error smaller than the set value as the emotion recognition model.

When the model of the initial network model is too large, some risk may be incurred. Based on the method, some penalty terms can be added in the process of training the model, so that the effect of model training is improved.

In one embodiment, the regularization term may be added to the loss function; alternatively, the initial network model is set to a penalty function with penalty weights as described above.

In some embodiments, the calculating the initial calculation result and the labeling result corresponding to the training data to obtain the current error of the current model may include: and calculating the initial calculation result and the labeling result corresponding to the training data in a mode of applying a penalty item, and calculating to obtain the current error of the current model.

In one example, the loss function can be described as:

above C ₀ Representing empirical risk, second half

Referred to as structural risk; λ represents a constant; n represents the number of training units in a training model. The empirical risk refers to the risk brought by an empirical difference generated by the sum of residual errors between a fitting result and a sample label, and is the risk of under-fitting; the structural risk is the risk caused by the model being not compact enough.

The structural risk is obtained by dividing the absolute value of all weights w in the whole model by the number of samples, wherein the weight w represents a punishment weight, which can be called a regularization coefficient or a punishment coefficient and represents the importance degree of the punishment. If the structural risk is emphasized, i.e. it is not desirable to have too much structural risk, the whole loss function can be moved towards the direction of decreasing weight w, in other words, the more and the larger the value of w, the larger the value of the whole factor, i.e. the less compact the model. The above-described regularization factor for the structural risk in the loss function is called the L1 regularization term.

In another example, the structural risk mentioned above can also be used with an L2 regularization term, and the loss function can then be expressed as:

In another embodiment, the inputting the training data into the latest training model for calculation to obtain the initial calculation result may include: and inputting the training data into the latest training model for calculation in a mode of applying a penalty item to obtain an initial calculation result.

Specifically, training of partial functions in the model can be randomly ignored in each training process by applying a penalty term, the training amount of each training can be reduced, and the generalization capability of the model can be improved.

In an example, please refer to fig. 3, in the training process, the fused training features are input into a density layer for calculation, and classification is implemented through a classification layer SOFTMAX to obtain the training output result.

According to the emotion recognition model training method provided by the embodiment of the application, the emotion recognition model is obtained by training the training data formed by the two types of characteristics of the voice and the text converted by the voice, and compared with the method for training the model by adopting a single data type in the prior art, the emotion recognition model has stronger adaptability, and the recognition effect of the trained model is better. In addition, the user emotion recognition model can realize recognition of the corresponding user state.

EXAMPLE III

The embodiment provides an emotion recognition method. The method in this embodiment may be performed by an electronic device. The emotion recognition method in this embodiment may be executed by a different device of the electronic device that executes the emotion recognition model training method in the second embodiment; or by the same electronic device as the one that performs the emotion recognition model training method in embodiment two. Fig. 4 shows a flow chart of a method of emotion recognition in an embodiment of the present application. The following describes in detail the flow of the emotion recognition model training method shown in fig. 4.

Step S301, current user data of the target user is acquired.

The method in the embodiment can be applied to the user terminal. The above-described step S301 may be implemented as: acquiring a picture or a video of a user through an image acquisition device of a user terminal; and voice data of a user can be acquired through the voice acquisition device.

In an application scenario, the emotion recognition method in this embodiment may be used in a car booking service, and the step S301 may be implemented as: video data, image data and the like are acquired through an in-vehicle camera connected with a user terminal.

The user terminal can also be provided with a target application program, and the target application program can be provided with a collection module for collecting user data.

The method in this embodiment may also be applied to a server that is in communication connection with a user terminal. The server acquires user data acquired by the user terminal.

Step S302, inputting the current user data into the emotion recognition model for recognition, and obtaining the current state of the target user.

The probability of whether the state of the user data match is safe can be identified through the emotion recognition model described above.

Optionally, if the output probability is greater than a set value, the current state of the target user is an unsafe state; and if the output probability value is smaller than the set value, the current state of the target user is obtained as the safe state.

In some embodiments, the method further comprises: and if the current state represents that the target user is in an unsafe state, generating a prompt message, and sending the prompt message to a target user terminal or a related platform.

The prompt message may include, but is not limited to, the current location, current status, captured current image of the target user, prompt voice, etc.

If the output result is "safe state", it indicates that the target user is in safe state, and other processing may not be performed. If the output result is in an unsafe state, the target user may have some potential safety hazards, and some prompt measures can be taken.

In one embodiment, if the emotion recognition method is used in a user terminal, a prompt voice in a prompt message may be output in the user terminal.

In another embodiment, a target application program may be installed in the user terminal, and the prompt message may be sent to a backend server that provides each service module of the target application program, so as to be further submitted to a terminal of a relevant administrator.

In another implementation, the prompt message may be sent to an account of the police, so as to directly alarm the police.

The unsafe state is effectively reminded in various modes, and the safety of the target user can be improved.

Example four

Based on the same application concept, an emotion recognition model training device corresponding to the emotion recognition model training method is further provided in the embodiment of the application, and as the principle of solving the problem of the device in the embodiment of the application is similar to the emotion recognition model training method in the embodiment of the application, the implementation of the device can refer to the implementation of the method, and repeated parts are not described again.

Fig. 5 is a block diagram illustrating an emotion recognition model training apparatus according to some embodiments of the present application, which implements functions corresponding to the steps performed by the above-described method. The apparatus may be understood as the server, or a processor of the server, or may be understood as a component that is independent from the server or the processor and implements the functions of the present application under the control of the server, as shown in the figure, the emotion recognition model training apparatus may include: a first acquisition module 401, a transformation module 402, a fusion module 403, and a training module 404, wherein,

a first obtaining module 401, configured to obtain user data, where the user data includes voice data;

a conversion module 402, configured to convert each piece of the voice data into text data;

a fusion module 403, configured to fuse each piece of user data and text data converted from voice data included in the user data into a piece of training feature, where the obtained training features corresponding to all user data form training data;

and the training module 404 is configured to input the training data into an initial network model for training, so as to obtain an emotion recognition model.

In some embodiments, the fusion module 403 is further configured to:

coding the text point on a first convolution network to obtain a first intermediate characteristic;

In some embodiments, the fusion module 403 is further configured to:

and performing outer product on any supplementary text feature and the voice feature corresponding to the supplementary text feature to obtain the training feature.

In some embodiments, the user data further includes image data, and the fusion module 403 is further configured to:

In some embodiments, the fusion module 403 is further configured to:

performing feature extraction on the image data to obtain image features;

In some embodiments, the fusion module 403 is further configured to:

In some embodiments, the training module 404 is further configured to:

The modules may be connected or in communication with each other via a wired or wireless connection. The wired connection may include a metal cable, an optical cable, a hybrid cable, etc., or any combination thereof. The wireless connection may include a connection via a LAN, WAN, bluetooth, zigBee, NFC, or the like, or any combination thereof. Two or more modules may be combined into a single module, and any one module may be divided into two or more units.

The description of the processing flow of each module in the device and the interaction flow between the modules may refer to the related description in the above method embodiments, and will not be described in detail here.

EXAMPLE five

Based on the same application concept, the embodiment of the application also provides an emotion recognition device corresponding to the emotion recognition method, and as the principle of solving the problems of the device in the embodiment of the application is similar to the emotion recognition method in the embodiment of the application, the implementation of the device can refer to the implementation of the method, and repeated parts are not repeated.

Fig. 6 is a block diagram illustrating an emotion recognition device according to some embodiments of the present application, which implements functions corresponding to the steps performed by the method described above. The device may be understood as the server or a processor of the server, or may be understood as a component that is independent of the server or the processor and implements the functions of the application under the control of the server, as shown in fig. 6, the emotion recognition device may include: a second obtaining module 501, and a recognition module 502, wherein,

a second obtaining module 501, configured to obtain current user data of a target user;

and the identifying module 502 is configured to input the current user data into the emotion identifying model for identification, so as to obtain the current state of the target user.

In some embodiments, the emotion recognition apparatus further includes:

and the prompting module 503 is configured to generate a prompting message if the current state represents that the target user is in an unsafe state, and send the prompting message to a terminal of the target user or an associated platform.

The modules may be connected or in communication with each other via a wired connection or a wireless connection. The wired connection may include a metal cable, an optical cable, a hybrid cable, etc., or any combination thereof. The wireless connection may include a connection via a LAN, WAN, bluetooth, zigBee, NFC, or the like, or any combination thereof. Two or more modules may be combined into a single module, and any one module may be divided into two or more units.

In addition, the embodiment of the present application further provides a computer readable storage medium, where a computer program is stored on the computer readable storage medium, and the computer program is executed by a processor to perform the steps of the emotion recognition model training method described in the above method embodiments.

Embodiments of the present application further provide a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, performs the steps of the emotion recognition method described in the above method embodiments.

The computer program product of the emotion recognition model training method provided in the embodiment of the present application includes a computer-readable storage medium storing a program code, where instructions included in the program code may be used to execute the steps of the emotion recognition model training method described in the above method embodiment, which may be referred to in the above method embodiment specifically, and are not described herein again.

The computer program product of the emotion recognition method provided in the embodiment of the present application includes a computer readable storage medium storing a program code, where instructions included in the program code may be used to execute the steps of the emotion recognition method described in the above method embodiment, which may be specifically referred to the above method embodiment, and are not described herein again.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the system and the apparatus described above may refer to corresponding processes in the method embodiments, and are not described in detail in this application. In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. The above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is merely a logical division, and there may be other divisions in actual implementation, and for example, a plurality of modules or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed coupling or direct coupling or communication connection between each other may be through some communication interfaces, indirect coupling or communication connection between devices or modules, and may be in an electrical, mechanical or other form.

The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in software functional units and sold or used as a stand-alone product, may be stored in a non-transitory computer-readable storage medium executable by a processor. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U disk, a removable hard disk, a ROM, a RAM, a magnetic disk or an optical disk, and various media capable of storing program codes.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily think of the changes or substitutions within the technical scope of the present application, and shall cover the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method for training an emotion recognition model, comprising:

acquiring user data, wherein the user data comprises voice data;

converting each piece of voice data into text data;

fusing each piece of user data and text data converted from voice data contained in the user data into a piece of training feature, and forming training data by using the obtained training features corresponding to all the user data;

inputting the training data into an initial network model for training to obtain an emotion recognition model;

the step of fusing each piece of user data and text data converted from voice data included in the user data into a piece of training feature, and forming training data from the training features corresponding to all the obtained user data includes: performing feature extraction on each piece of text data to obtain text features; performing feature extraction on each piece of voice data to obtain voice features; any one voice feature and the text feature corresponding to any one voice feature are fused to obtain a training feature, and the obtained training features corresponding to all user data form training data;

the step of extracting the characteristics of each piece of voice data to obtain voice characteristics includes: processing the voice data through a second convolution network to obtain a second intermediate characteristic; inputting the second intermediate features into a long-short term memory model network, and identifying the context dependency of the second intermediate features; and extracting important features in the second intermediate features through maximum value pooling processing to obtain the voice features.

2. The method of claim 1, wherein the step of extracting the feature of each text data to obtain the text feature comprises:

3. The method of claim 2, wherein said step of encoding said text point on a first convolutional network to obtain a first intermediate feature comprises:

4. The method of claim 1, wherein the step of processing the speech data through a second convolutional network to obtain a second intermediate feature comprises:

5. The method of claim 1, wherein the step of fusing any one of the speech features with the text feature corresponding to the any one of the speech features to obtain a training feature comprises:

6. The method of claim 5, wherein the step of concatenating any speech feature with the text feature corresponding to any speech feature to form a training feature comprises:

7. The method according to claim 1, wherein the user data further includes image data, and the step of fusing each piece of the user data and text data converted from speech data included in the user data into a piece of training feature, and forming training data by using training features corresponding to all pieces of the user data includes:

8. The method according to claim 7, wherein the step of fusing each piece of the user data and text data converted from speech data included in the user data into a piece of training feature, and forming the training data by the obtained training features corresponding to all pieces of the user data, includes:

performing feature extraction on the image data to obtain image features;

9. The method of claim 8, wherein the step of extracting features from the image data to obtain image features comprises:

and extracting important features in the third intermediate features through maximum value pooling processing to obtain voice features.

10. The method of claim 9, wherein the step of processing the image data through a third convolutional network to obtain a third intermediate feature comprises:

and carrying out weighting processing on the third original characteristic and the third attention characteristic to obtain a third intermediate characteristic.

11. The method of claim 1, wherein the step of inputting the training data into an initial network model for training to obtain an emotion recognition model comprises:

12. The method of claim 11, wherein the step of calculating the initial calculation result and the labeling result corresponding to the training data to obtain the current error of the current model comprises: calculating the initial calculation result and a labeling result corresponding to the training data in a mode of applying a penalty item, and calculating to obtain a current error of the current model; alternatively, the first and second electrodes may be,

13. A method of emotion recognition, comprising:

acquiring current user data of a target user;

inputting the current user data into the emotion recognition model of any one of claims 1-12 for recognition, and obtaining the current state of the target user.

14. The method of claim 13, wherein the method further comprises:

and if the current state represents that the target user is in an unsafe state, generating a prompt message, and sending the prompt message to a target user terminal or a related platform.

15. An emotion recognition model training apparatus, comprising:

the device comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring user data, and the user data comprises voice data;

the training module is used for inputting the training data into an initial network model for training to obtain an emotion recognition model;

wherein the fusion module is further configured to: performing feature extraction on each piece of text data to obtain text features; performing feature extraction on each piece of voice data to obtain voice features; any one voice feature and the text feature corresponding to any one voice feature are fused to obtain a training feature, and the obtained training features corresponding to all user data form training data;

the fusion module is further used for processing the voice data through a second convolution network to obtain a second intermediate characteristic; inputting the second intermediate features into a long-short term memory model network, and identifying the context dependency of the second intermediate features; and extracting important features in the second intermediate features through maximum pooling to obtain the voice features.

16. The apparatus of claim 15, wherein the fusion module is further configured to:

17. The apparatus of claim 16, wherein the fusion module is further configured to:

processing the text points in a first convolution network to obtain first original characteristics;

18. The apparatus of claim 17, wherein the fusion module is further configured to:

19. The apparatus of claim 15, wherein the fusion module is further configured to:

20. The apparatus of claim 19, wherein the fusion module is further configured to:

21. The apparatus of claim 15, wherein the user data further comprises image data, the fusion module further to:

22. The apparatus of claim 21, wherein the fusion module is further configured to:

performing feature extraction on the image data to obtain image features;

23. The apparatus of claim 22, wherein the fusion module is further configured to:

24. The apparatus of claim 23, wherein the fusion module is further configured to:

25. The apparatus of claim 15, wherein the training module is further configured to:

26. The apparatus of claim 25, wherein the training module is further configured to:

calculating the initial calculation result and a labeling result corresponding to the training data in a mode of applying a penalty item, and calculating to obtain a current error of the current model; alternatively, the first and second liquid crystal display panels may be,

27. An emotion recognition apparatus, comprising:

an identification module, configured to input the current user data into the emotion recognition model according to any one of claims 1 to 12 for identification, so as to obtain a current state of the target user.

28. The apparatus of claim 27, wherein the apparatus further comprises:

29. An electronic device, comprising: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating over the bus when the electronic device is operating, the machine-readable instructions, when executed by the processor, performing the steps of the method of any one of claims 1 to 14.

30. A computer-readable storage medium, having stored thereon a computer program which, when being executed by a processor, is adapted to carry out the steps of the method according to any one of claims 1 to 14.