CN110909131A

CN110909131A - Model generation method, emotion recognition method, system, device and storage medium

Info

Publication number: CN110909131A
Application number: CN201911172477.7A
Authority: CN
Inventors: 邓艳江; 罗超; 胡泓; 成丹妮
Original assignee: Ctrip Computer Technology Shanghai Co Ltd
Current assignee: Ctrip Computer Technology Shanghai Co Ltd
Priority date: 2019-11-26
Filing date: 2019-11-26
Publication date: 2020-03-24

Abstract

The invention discloses a model generation method, an emotion recognition method, a system, equipment and a storage medium, wherein the emotion recognition model generation method comprises the following steps: acquiring a plurality of historical voice signals; processing each historical voice signal to obtain a spectrogram; converting each historical voice signal into text information, and processing the text information to generate word vectors; and performing model training by taking the plurality of spectrogram and the plurality of word vectors as sample data based on a deep learning network to obtain a multi-modal emotion recognition model. The invention establishes a multi-mode emotion recognition model based on two input of text information of voice and voice recognition conversion by utilizing a multi-mode thought, can monitor the emotion of a guest in a voice telephone in real time, and solves the problems of low accuracy of bad emotion recognition, larger intuitive feeling of the guest and larger difference between real people and low user experience caused by losing semantic information due to the fact that emotion is analyzed by only utilizing voice characteristics.

Description

Model generation method, emotion recognition method, system, device and storage medium

Technical Field

The invention belongs to the artificial intelligence technology, and particularly relates to a model generation method, an emotion recognition method, a system, equipment and a storage medium.

Background

With the development of artificial intelligence technology, many repetitive tasks are completed by a machine, and a customer service robot is an example. Under the voice scene of the customer service robot and the guests, the customer service robot can be competent for the work with clear flow, but a few guests can have some bad emotions, such as impatience, anger, worries and the like. If the guest is in the emotion, the service provided by the service robot may cause stronger dissatisfaction of the guest. Therefore, in order to make the customer service robot better serve the customer, the ability of identifying the customer emotion is also necessary. That is to say, in the service process of the customer service robot, the emotion of the guest needs to be monitored in real time, and if the emotion is abnormal, early warning is immediately performed and manual work is switched in.

At present, emotion recognition model is applied to the voice of the many bases of customer in industry, and the emotion of customer is monitored in real time, and if the customer gives an early warning when bad emotion appears, the conversation is switched to manual service to guarantee quality of service. The speech application emotion recognition model based on the client is used for recognizing bad emotions, so that the recognition accuracy is low, the intuitive feeling of the client is greatly different from that of a real person, and the problem of low user experience is caused.

Disclosure of Invention

The invention aims to overcome the defects that in the prior art, the recognition accuracy is not high, the intuitive feeling of a guest is greatly different from that of a real person, and further the user experience is not high when the emotion recognition model is applied to perform bad emotion recognition based on the voice of the guest.

The invention solves the technical problems through the following technical scheme:

the invention provides a method for generating an emotion recognition model in a first aspect, which comprises the following steps:

acquiring a plurality of historical voice signals;

processing each historical voice signal to obtain a spectrogram;

converting each historical voice signal into text information, and processing the text information to generate word vectors;

and performing model training on the basis of a deep learning network by taking the plurality of spectrogram and the plurality of word vectors as sample data to obtain a multi-modal emotion recognition model.

Preferably, the deep learning network comprises a first multi-layer one-way LSTM (Long Short-Term Memory) network and a second multi-layer one-way LSTM network;

the step of performing model training based on a deep learning network by using the plurality of spectrogram and the plurality of word vectors as sample data to obtain a multi-modal emotion recognition model comprises the following steps:

using a plurality of spectrogram patterns as input of the first multilayer unidirectional LSTM network;

using a number of the word vectors as input to the second multi-layer unidirectional LSTM network.

Preferably, the step of performing model training based on a deep learning network by using the plurality of spectrogram and the plurality of word vectors as sample data to obtain a multi-modal emotion recognition model includes:

screening and aggregating hidden vectors output by the first multi-layer unidirectional LSTM network by utilizing an Attention mechanism to obtain a first context vector;

screening and aggregating the hidden vectors output by the second multilayer unidirectional LSTM network by using an Attention mechanism to obtain a second context vector;

after voting the first context vector and the second context vector, classifying the first context vector and the second context vector by using a SoftMax (name of a function) function.

Preferably, the voting process for the first context vector and the second context vector comprises:

adding the first context vector and the second context vector or stitching the first context vector and the second context vector into one vector.

Preferably, the step of processing each of the historical speech signals to obtain a spectrogram comprises:

preprocessing each historical voice signal to obtain the frequency spectrum of each frame;

combining the frequency spectrums of the frames into the spectrogram along a time sequence;

the preprocessing comprises pre-emphasis, framing, windowing and fast Fourier transform;

and/or the presence of a gas in the gas,

the step of processing the text information to generate a word vector comprises:

performing word segmentation on the text information;

mapping the words obtained after word segmentation into the word vectors by utilizing a trained word vector library;

the word vector library is obtained by utilizing a plurality of user comment texts and training the comment texts by using an unsupervised learning algorithm.

The second aspect of the present invention provides an emotion recognition method, including:

processing the voice signal to be recognized to obtain a spectrogram of the voice to be recognized;

converting the voice signal to be recognized into text information to be recognized, and processing the text information to be recognized to generate a word vector to be recognized;

inputting the spectrogram and the vector of the word to be recognized into a multi-modal emotion recognition model for classification to obtain an emotion recognition result;

the multi-modal emotion recognition model is generated using the method for generating an emotion recognition model according to the first aspect.

A third aspect of the present invention provides a system for generating an emotion recognition model, including:

the acquisition module is used for acquiring a plurality of historical voice signals;

the first processing module is used for processing each historical voice signal to obtain a spectrogram;

the second processing module is used for converting each historical voice signal into text information;

the third processing module is used for processing the text information to generate a word vector;

and the model training module is used for performing model training on the basis of a deep learning network by taking the plurality of spectrogram and the plurality of word vectors as sample data to obtain a multi-modal emotion recognition model.

Preferably, the deep learning network comprises a first multi-layer unidirectional LSTM network and a second multi-layer unidirectional LSTM network;

and the model training module takes a plurality of spectrogram as the input of the first multilayer unidirectional LSTM network and takes a plurality of word vectors as the input of the second multilayer unidirectional LSTM network.

Preferably, the model training module screens and aggregates hidden vectors output by the first multi-layer unidirectional LSTM network by using an Attention mechanism to obtain a first context vector; the model training module also screens and aggregates hidden vectors output by the second multilayer unidirectional LSTM network by using an Attention mechanism to obtain a second context vector;

the model training module is further used for conducting voting processing on the first context vector and the second context vector and then conducting classification through a SoftMax function.

Preferably, the voting process on the first context vector and the second context vector comprises adding the first context vector and the second context vector;

preferably, the voting process on the first context vector and the second context vector comprises splicing the first context vector and the second context vector into one vector.

Preferably, the first processing module comprises a preprocessing unit and a splicing unit;

the preprocessing unit is used for preprocessing each historical voice signal to obtain the frequency spectrum of each frame;

the splicing unit is used for splicing the frequency spectrums of the frames into the spectrogram along a time sequence;

and/or the presence of a gas in the gas,

the second processing module comprises a word segmentation unit and a mapping unit;

the word segmentation unit is used for segmenting the text information;

the mapping unit is used for mapping the words obtained after word segmentation into the word vectors by utilizing a trained word vector library;

the word vector library is obtained by utilizing comment texts of a plurality of users and training the comment texts by using an unsupervised learning algorithm.

A fourth aspect of the present invention provides an emotion recognition system, including:

the fourth processing module is used for processing the voice signal to be recognized to obtain a spectrogram to be recognized;

the fifth processing module is used for converting the voice signal to be recognized into text information to be recognized and processing the text information to be recognized to generate a word vector to be recognized;

the recognition module is used for inputting the spectrogram and the vector of the word to be recognized into a multi-modal emotion recognition model for classification so as to obtain an emotion recognition result;

the multi-modal emotion recognition model is generated using the emotion recognition model generation system according to the third aspect.

A fifth aspect of the invention provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method for generating an emotion recognition model according to the first aspect or the method for emotion recognition according to the second aspect when executing the computer program.

A sixth aspect of the present invention provides a computer-readable storage medium having stored thereon a computer program which, when being executed by a processor, carries out the steps of the method for generating a mood recognition model according to the first aspect or the method for recognizing a mood according to the second aspect.

The positive progress effects of the invention are as follows:

the invention provides a model generation method, an emotion recognition method, a system, equipment and a storage medium, and establishes a multi-modal emotion recognition model based on two kinds of input of text information of voice and voice recognition conversion by utilizing the thought of multi-mode. The model belongs to a classification model, and after speech and text features are respectively extracted by utilizing a deep learning framework, the two parts of features are integrated for classification so as to realize the purpose of judging the emotion of a guest. The invention can monitor the emotion of the guest in the voice telephone in real time, and solves the problems of low accuracy of bad emotion recognition, large intuitive feeling of the guest and great difference between real people and low user experience caused by losing semantic information by analyzing the emotion by only utilizing voice characteristics.

Drawings

Fig. 1 is a flowchart of a method for generating an emotion recognition model according to embodiment 1 of the present invention.

Fig. 2 is a flowchart of step S4 in fig. 1.

Fig. 3 is a flowchart of step S1 in fig. 1.

Fig. 4 is a flowchart of step S3 in fig. 1.

Fig. 5 is a flowchart of an emotion recognition method in embodiment 2 of the present invention.

Fig. 6 is a schematic diagram of a process of performing emotion recognition on a speech signal to be recognized by using the emotion recognition method provided in embodiment 2.

Fig. 7 is a schematic block diagram of a system for generating an emotion recognition model according to embodiment 3 of the present invention.

Fig. 8 is a schematic block diagram of an emotion recognition system according to embodiment 4 of the present invention.

Fig. 9 is a schematic structural diagram of an electronic device according to embodiment 5 of the present invention.

Detailed Description

The invention is further illustrated by the following examples, which are not intended to limit the scope of the invention.

Example 1

As shown in fig. 1, the present embodiment provides a method for generating an emotion recognition model, including:

step S1, acquiring a plurality of historical voice signals.

And step S2, processing each historical voice signal to obtain a spectrogram.

Step S3, converting each historical speech signal into text information, and processing the text information to generate a word vector.

And step S4, performing model training by taking the plurality of spectrogram and the plurality of word vectors as sample data based on the deep learning network to obtain a multi-modal emotion recognition model.

The deep learning network comprises a first multi-layer unidirectional LSTM network and a second multi-layer unidirectional LSTM network. In step S4, the spectrogram is used as input of a first multi-layer unidirectional LSTM network, and the word vector is used as input of a second multi-layer unidirectional LSTM network. In this embodiment, the first multi-layer unidirectional LSTM network and the second multi-layer unidirectional LSTM network are both implemented using two layers of unidirectional LSTM networks.

There is no hard requirement for the number of sample data. In this example, 21500 historical speech signals with an average duration of about 5s (sec) were collected, 1500 extreme emotions and 20000 non-extreme emotions. And each historical voice signal is labeled manually to generate a label, wherein the label value has two types, one type represents extreme emotion, and the other type represents non-extreme emotion.

As shown in fig. 2, step S4 in this embodiment includes the following steps:

step S401, an Attention mechanism is utilized to screen and aggregate hidden vectors output by the first multi-layer unidirectional LSTM network, so as to obtain a first context vector.

Step S401', an Attention mechanism is utilized to screen and aggregate the hidden vectors output by the second multi-layer unidirectional LSTM network, so as to obtain a second context vector.

The specific implementation manner of the Attention mechanism is as follows:

u_t＝tanh(W_wh_t+b_w)；

s＝∑_tα_th_t；

wherein h is_tHidden vector, W, representing LSTM network output at time t_wIs a parameter matrix, b_wTo be offset, u_wAs a vector of parameters, α_tRepresents the normalized weight factor at time t, and s is the context vector calculated by the Attention mechanism.

In this embodiment, the vectors subjected to the preliminary feature extraction by the first multi-layer unidirectional LSTM network and the second multi-layer unidirectional LSTM network are subjected to feature screening and aggregation by an Attention mechanism, so as to obtain corresponding context vectors respectively. The execution sequence of step S401 and step S401' is not limited, and may be performed simultaneously or sequentially.

And S402, after voting is conducted on the first context vector and the second context vector, classification is conducted through a SoftMax function. The voting process can be to add the first context vector and the second context vector, or to combine the first context vector and the second context vector into one vector, so as to achieve the purpose of combining the two types of features. For example: the first context vector is [1, 2, 3], the second context vector is [4, 5, 6, 7], and the resulting vector is denoted as [1, 2, 3, 4, 5, 6, 7 ]. The former implementation is adopted in the present embodiment.

As shown in fig. 3, step S1 in this embodiment includes the following steps:

step S101, preprocessing each historical voice signal to obtain the frequency spectrum of each frame. Wherein the pre-processing includes pre-emphasis, framing, windowing, and fast fourier transform.

And step S102, combining the frequency spectrums of each frame into a spectrogram along a time sequence.

In the embodiment, voice data, namely historical voice signals, are a series of numerical value sequences obtained after sampling by using a sensor, the numerical value sequences carry time domain information, and the time domain is difficult to represent the most fundamental characteristic of sound, so that preprocessing is performed by using an electronic signal processing means.

As shown in fig. 4, the step of processing the text information to generate the word vector in step S3 in the present embodiment includes:

s301, performing word segmentation on the text information;

and step S302, mapping the words obtained after word segmentation into word vectors by using a trained word vector library.

In this embodiment, for processing text data, that is, text information, a word vector library trained in advance is used to map words into a multidimensional vector, and the vector carries semantic information. The user comment text used in the method is determined according to a specific application scenario of emotion recognition, for example, the user comment text is used in an application scenario of voices of a customer service robot and a guest in an OTA (Online Travel) website, so that collected user comment texts such as hotels and Travel products are trained to obtain a word vector library in an unsupervised learning manner.

The embodiment utilizes the multi-modal idea to establish a multi-modal emotion recognition model based on two kinds of input of text information converted by voice and voice recognition. The model belongs to a classification model, and by utilizing a deep learning framework, after voice and text characteristics are respectively extracted, the characteristics of the two parts are integrated for classification, so that the aim of identifying the emotion of a guest is fulfilled. By using the result of the multi-modal emotion recognition model, the customer service robot can detect the emotion state of the guest in real time, and if the emotion state is abnormal, the customer service robot can further process the abnormal emotion state, such as early warning and manual service. In addition, the result of the multi-modal emotion recognition model can be used by other session logic components of the customer service robot, so that the customer service robot can feel the visual feeling of the guest closer to the real person.

Example 2

As shown in fig. 5, the present embodiment provides an emotion recognition method, including the steps of:

and T1, processing the voice signal to be recognized to obtain a spectrogram to be recognized.

And step T2, converting the voice signal to be recognized into text information to be recognized, and processing the text information to be recognized to generate a word vector to be recognized.

And step T3, inputting the spectrogram and the vector of the word to be recognized into the multi-modal emotion recognition model for classification so as to obtain an emotion recognition result. Wherein the multi-modal emotion recognition model is generated using the method of generating an emotion recognition model described in embodiment 1.

To further illustrate the technical solution and the technical effect of the present embodiment, the following description is made with reference to fig. 6:

as shown in fig. 6, during the voice communication with the client, the service robot receives the to-be-recognized voice signal 1 of "do not cooperate with other people and call your" from the client. The emotion recognition method provided by the embodiment is adopted for processing as follows: firstly, a spectrogram 3 to be recognized is obtained by processing in a step T1, the processing mode comprises FFT (fast Fourier transform), then the speech signal 1 to be recognized is converted into text information to be recognized in a step T2, the text information to be recognized is processed to generate a word vector to be recognized, finally, the spectrogram 3 to be recognized and the word vector to be recognized are input into a multi-modal emotion recognition model 6 for classification through a step T3 to obtain an emotion recognition result, the output result is a two-dimensional vector, such as [0.2, 0.8], wherein 0.8 represents the probability of extreme emotion, 0.2 represents the probability of non-extreme emotion, and a character result of emotion recognition, namely 'extreme emotion' or 'non-extreme emotion', is obtained after the translation processing in a step Label. In this example, the probability of extreme emotion is high, and therefore, the final character result after Label translation is "extreme emotion". As shown in fig. 6, the multimodal emotion recognition model 6 includes a first two-layer unidirectional LSTM network 4 and a second two-layer unidirectional LSTM network 5, and after a vector output by an Attention mechanism enters into a sense (full connection layer), a Relu (name of an activation function) activates a function to process the vector, and then Add (Add) the processed vector to obtain a feature vector combining two types of speech and text, and finally classify the feature vector by using a SoftMax function to obtain an emotion recognition result.

In the emotion recognition method provided by this embodiment, by using the multi-modal emotion recognition model generated in embodiment 1, the customer service robot can detect the emotional state of the guest in real time and recognize the bad emotion. The abnormal condition can be further processed for extreme emotions, such as early warning and transferring to manual service. According to the embodiment, the accuracy rate of recognizing bad emotions and the user experience degree can be improved, and the visual feeling of the customer service robot for the guest is closer to a real person.

Example 3

As shown in fig. 7, the present embodiment provides a generation system of an emotion recognition model, which includes an acquisition module 1, a first processing module 2, a second processing module 3, a third processing module 4, and a model training module 5.

The obtaining module 1 is used for obtaining a plurality of historical voice signals. The first processing module 2 is configured to process each historical speech signal to obtain a spectrogram. The second processing module 3 is used for converting each historical speech signal into text information. The third processing module 4 is configured to process the text information to generate a word vector.

The model training module 5 is used for performing model training on the basis of the deep learning network by taking the plurality of spectrogram and the plurality of word vectors as sample data to obtain a multi-modal emotion recognition model. The deep learning network comprises a first multi-layer unidirectional LSTM network and a second multi-layer unidirectional LSTM network. In the model training module 5, a plurality of spectrogram are used as the input of a first multilayer unidirectional LSTM network, and a plurality of word vectors are used as the input of a second multilayer unidirectional LSTM network.

In this embodiment, the model training module 5 utilizes an Attention mechanism to perform screening and aggregation on hidden vectors output by the first multi-layer unidirectional LSTM network, so as to obtain a first context vector. The model training module 5 also screens and aggregates hidden vectors output by the second multi-layer unidirectional LSTM network by using an Attention mechanism to obtain a second context vector. The model training module 5 is further configured to perform voting on the first context vector and the second context vector and then classify the first context vector and the second context vector by using a SoftMax function. The voting process of the first context vector and the second context vector refers to adding the first context vector and the second context vector or combining the first context vector and the second context vector into one vector.

In this embodiment, the first processing module 2 includes a preprocessing unit and a splicing unit. The preprocessing unit is used for preprocessing each historical voice signal to obtain the frequency spectrum of each frame; the pre-processing includes pre-emphasis, framing, windowing, and fast fourier transform. The splicing unit is used for splicing the frequency spectrums of each frame into a spectrogram along a time sequence.

In this embodiment, the second processing module 3 includes a word segmentation unit and a mapping unit. The word segmentation unit is used for segmenting the text information. The mapping unit is used for mapping the words obtained after word segmentation into word vectors by utilizing a trained word vector library; the word vector library is obtained by using the comment texts of a plurality of users and training the comment texts by using an unsupervised learning algorithm.

In this embodiment, for processing text data, that is, text information, a word vector library trained in advance is used to map words into a multidimensional vector, and the vector carries semantic information. The user comment text used in the method is determined according to a specific application scene of emotion recognition, for example, the method is used in an application scene of speech of a guest robot and a guest in an OTA website, the user comment text of hotels, travel products and the like is collected, and a word vector library is obtained through training in an unsupervised learning mode.

Example 4

As shown in fig. 8, the present embodiment provides an emotion recognition system including a fourth processing module 6, a fifth processing module 7, and a recognition module 8. The fourth processing module 6 is configured to process the speech signal to be recognized to obtain a spectrogram to be recognized. The fifth processing module 7 is configured to convert the speech signal to be recognized into text information to be recognized, and process the text information to be recognized to generate a word vector to be recognized. The recognition module 8 is used for inputting the spectrogram and the vector of the word to be recognized into the multi-modal emotion recognition model for classification so as to obtain an emotion recognition result. Wherein the multi-modal emotion recognition model is generated using the emotion recognition model generation system described in embodiment 3.

The emotion recognition system provided in this embodiment uses the multi-modal emotion recognition model generated in embodiment 3, and realizes that the customer service robot can detect the emotional state of the guest in real time and recognize the bad emotion. The abnormal condition can be further processed for extreme emotions, such as early warning and transferring to manual service. According to the embodiment, the accuracy rate of recognizing bad emotions and the user experience degree can be improved, and the visual feeling of the customer service robot for the guest is closer to a real person.

Example 5

Fig. 9 is a schematic structural diagram of an electronic device according to embodiment 5 of the present invention. The electronic device comprises a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of generating the emotion recognition model of embodiment 1 or the method of emotion recognition of embodiment 2 when executing the program. The electronic device 30 shown in fig. 9 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiment of the present invention.

As shown in fig. 9, the electronic device 30 may be embodied in the form of a general purpose computing device, which may be, for example, a server device. The components of the electronic device 30 may include, but are not limited to: the at least one processor 31, the at least one memory 32, and a bus 33 connecting the various system components (including the memory 32 and the processor 31).

The bus 33 includes a data bus, an address bus, and a control bus.

The memory 32 may include volatile memory, such as Random Access Memory (RAM)321 and/or cache memory 322, and may further include Read Only Memory (ROM) 323.

Memory 32 may also include a program/utility 325 having a set (at least one) of program modules 324, such program modules 324 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

The processor 31 executes various functional applications and data processing such as the generation method of the emotion recognition model provided in embodiment 1 of the present invention or the emotion recognition method provided in embodiment 2 by running the computer program stored in the memory 32.

The electronic device 30 may also communicate with one or more external devices 34 (e.g., keyboard, pointing device, etc.). Such communication may be through input/output (I/O) interfaces 35. Also, model-generating device 30 may also communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the Internet) via network adapter 36. As shown, network adapter 36 communicates with the other modules of model-generating device 30 via bus 33. It should be understood that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the model-generating device 30, including but not limited to: microcode, device drivers, redundant processors, external disk drive arrays, RAID (disk array) systems, tape drives, and data backup storage systems, etc.

It should be noted that although in the above detailed description several units/modules or sub-units/modules of the electronic device are mentioned, such a division is merely exemplary and not mandatory. Indeed, the features and functionality of two or more of the units/modules described above may be embodied in one unit/module according to embodiments of the invention. Conversely, the features and functions of one unit/module described above may be further divided into embodiments by a plurality of units/modules.

Example 6

The present embodiment provides a computer-readable storage medium on which a computer program is stored, which when executed by a processor, implements the steps of the method for generating an emotion recognition model provided in embodiment 1 or the method for emotion recognition provided in embodiment 2.

More specific examples, among others, that the readable storage medium may employ may include, but are not limited to: a portable disk, a hard disk, random access memory, read only memory, erasable programmable read only memory, optical storage device, magnetic storage device, or any suitable combination of the foregoing.

In a possible implementation, the invention may also be implemented in the form of a program product comprising program code for causing a terminal device to perform the steps of implementing the method for generating an emotion recognition model as described in embodiment 1 or the method for emotion recognition as described in embodiment 2, when said program product is run on said terminal device.

Where program code for carrying out the invention is written in any combination of one or more programming languages, the program code may be executed entirely on the user device, partly on the user device, as a stand-alone software package, partly on the user device and partly on a remote device or entirely on the remote device.

While specific embodiments of the invention have been described above, it will be appreciated by those skilled in the art that this is by way of example only, and that the scope of the invention is defined by the appended claims. Various changes and modifications to these embodiments may be made by those skilled in the art without departing from the spirit and scope of the invention, and these changes and modifications are within the scope of the invention.

Claims

1. A method for generating an emotion recognition model, comprising:

acquiring a plurality of historical voice signals;

processing each historical voice signal to obtain a spectrogram;

2. A method of generating an emotion recognition model as recited in claim 1, wherein said deep learning network comprises a first multi-layer unidirectional LSTM network and a second multi-layer unidirectional LSTM network;

3. The method for generating an emotion recognition model as recited in claim 2, wherein the step of performing model training based on a deep learning network using the plurality of spectrogram units and the plurality of word vectors as sample data to obtain a multi-modal emotion recognition model comprises:

screening and aggregating the hidden vectors output by the first multilayer unidirectional LSTM network by using an Attention mechanism to obtain a first context vector;

and after voting is carried out on the first context vector and the second context vector, classifying by utilizing a SoftMax function.

4. A generation method of an emotion recognition model as recited in claim 3, wherein said step of subjecting the first context vector and the second context vector to voting processing includes:

5. A method of generating an emotion recognition model as recited in claim 1, wherein said step of processing each of said historical speech signals to obtain a spectrogram comprises:

and/or the presence of a gas in the gas,

performing word segmentation on the text information;

6. A method of emotion recognition, comprising:

the multi-modal emotion recognition model is generated using the method of generating an emotion recognition model as claimed in any one of claims 1 to 5.

7. A system for generating an emotion recognition model, comprising:

8. A generation system of an emotion recognition model as recited in claim 7, wherein said deep learning network includes a first multi-layer unidirectional LSTM network and a second multi-layer unidirectional LSTM network;

9. The system for generating an emotion recognition model as recited in claim 8, wherein the model training module utilizes an Attention mechanism to filter and aggregate hidden vectors outputted from the first multi-layer unidirectional LSTM network to obtain a first context vector; the model training module also screens and aggregates hidden vectors output by the second multilayer unidirectional LSTM network by using an Attention mechanism to obtain a second context vector;

10. The system for generating an emotion recognition model as recited in claim 9,

the voting process on the first context vector and the second context vector comprises adding the first context vector to the second context vector;

or the like, or, alternatively,

the voting process on the first context vector and the second context vector comprises splicing the first context vector and the second context vector into one vector.

11. The system for generating an emotion recognition model of claim 7, wherein the first processing module includes a preprocessing unit and a stitching unit;

and/or the presence of a gas in the gas,

the word segmentation unit is used for segmenting the text information;

12. An emotion recognition system, comprising:

the multi-modal emotion recognition model is generated using the system for generating an emotion recognition model as claimed in any one of claims 7 to 11.

13. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method of generating an emotion recognition model as claimed in any of claims 1 to 5 or the method of emotion recognition as claimed in claim 6 when the computer program is executed.

14. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of generating a mood recognition model according to any one of claims 1 to 5 or the method of mood recognition according to claim 6.