WO2021135457A1

WO2021135457A1 - Recurrent neural network-based emotion recognition method, apparatus, and storage medium

Info

Publication number: WO2021135457A1
Application number: PCT/CN2020/118498
Authority: WO
Inventors: 王彦; 张加语; 马骏; 王少军
Original assignee: 平安科技（深圳）有限公司
Priority date: 2020-08-06
Filing date: 2020-09-28
Publication date: 2021-07-08
Also published as: CN111950275B; CN111950275A

Abstract

Provided are a recurrent neural network-based emotion recognition method, apparatus, and storage medium. The method comprises: obtaining text features of each sentence in the content of a conversation (S101); encoding the text features of each of said sentences to obtain the contextual features of each sentence (S102); for each of the sentences, updating a speaker status feature on the basis of the text feature of the sentence, the speaker status feature before the update being obtained on the basis of the text features of all previous sentences of said speaker (S103); for each of the sentences, determining a speaker switching state on the basis of the speaker of the sentence and the speaker of the previous sentence (S104); for each of the sentences, on the basis of the context features of the sentence, the speaker status features of the speaker of the sentence, and the speaker switching status of the sentence, obtaining an emotion recognition result of the sentence (S105); by the described means, the accuracy of emotion recognition is increased, and the dependency relationships between speakers and of the speakers themselves is more accurately modeled, simplifying the calculation process, and increasing calculation efficiency without affecting the accuracy of emotion recognition.

Description

Emotion recognition method, device and storage medium based on cyclic neural network

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on August 6, 2020, with the application number 202010785309.1, and the invention titled "Recurrent Neural Network-based Emotion Recognition Method, Apparatus, and Storage Medium". The entire content is approved The reference is incorporated in this application.

【Technical Field】

This application relates to the field of artificial intelligence recognition, and in particular to a method, device and storage medium for emotion recognition based on cyclic neural network.

【Background technique】

In the field of natural language processing, dialogue emotion recognition technology has received more and more attention. With the popularity of social media platforms and conversational agents, conversational materials have grown, and it has become feasible to mine the overall dynamic emotional changes of the speaker from conversations, and it has been widely used in opinion mining, medical health, and call centers through the use of conversational emotion recognition. Waiting for the scene, not only can mine the emotional point of view of the speaker, but also help to build an emotional intelligent robot dialogue system.

Early research on dialogue emotion recognition was mainly based on call center dialogue data, using dictionary-based methods and audio features to recognize emotions. In recent years, the research on dialogue emotion recognition is mainly based on deep learning algorithms such as convolutional neural networks, recurrent neural networks, graph convolutional networks, Transformers, etc., using plain text corpus or multimodal data training including text, audio, and video data. Emotion recognition model. Among these dialog emotion recognition methods, the dictionary-based method only recognizes emotions based on a single sentence in the dialog. Some methods based on deep learning use volumes and neural networks or other models as sentence encoders to generate sentence vectors for the current sentence and directly input them into the fully connected network, and finally use the softmax function to output the probability distribution of sentiment labels. However, these methods ignore the dialogue context information and cannot model the dependency relationship between the sentence and the speaker from a global perspective, which limits the improvement of the accuracy of emotion classification.

In order to solve the problem of ignoring context information, models such as DialogueRNN, KET, DialogueGCN in the prior art input contextual sentence vectors into a cyclic neural network or Transformer to model the mutual influence between speakers. Not only that, DialogueRNN and DialogueGCN respectively use recurrent neural networks and graph convolutional networks to capture the speaker's self-dependence to model the interaction between all sentences belonging to the same speaker. However, the inventor realizes that the method in the prior art still has the following problems: on the one hand, it ignores the speaker switching information in the dialogue, and cannot detect whether the speaker has changed, which affects the dependence of the model on the speaker. The understanding of the speaker's own dependence relationship limits the improvement of the accuracy of emotion recognition; on the other hand, the models used in these methods to model the speaker's own dependence relationship are more complex, difficult to implement, and affect calculation efficiency.

Therefore, it is necessary to provide a new method, device and storage medium for emotion recognition based on recurrent neural network.

[Summary of the invention]

The purpose of this application is to provide an emotion recognition method, device, and storage medium based on a cyclic neural network, so as to solve the technical problems of low emotion recognition accuracy and low calculation efficiency in the prior art.

The technical solution of the present application is as follows: A method for emotion recognition based on a recurrent neural network is provided, including:

Obtain the text features of each sentence in the dialogue content;

Encoding the text feature of each sentence to obtain the context feature of each sentence;

For each sentence, update the speaker status feature of the speaker of the sentence based on the text feature of the sentence, wherein the speaker status feature before the update is based on the text of all previous sentences of the speaker Acquired by characteristics;

For each sentence, determine the speaker switching state of the sentence based on the speaker of the sentence and the speaker of the previous sentence;

For each sentence, the emotion recognition result of the sentence is obtained based on the context feature of the sentence, the speaker state feature of the speaker of the sentence, and the speaker switching state of the sentence.

Another technical solution of the present application is as follows: Provide an emotion recognition device based on a cyclic neural network, the device including:

Sentence encoder, used to obtain the text features of each sentence in the dialogue content;

A context encoder, which is used to encode the text features of each sentence to obtain the context features of each sentence;

The speaker encoder is used to update the speaker status feature of the speaker of the sentence based on the text feature of the sentence for each sentence, wherein the speaker status feature before the update is based on the speech Obtained from the text features of all previous sentences of the person;

The speaker conversion module is used to determine, for each sentence, the speaker switching state of the sentence based on the speaker of the sentence and the speaker of the previous sentence; and

The emotion recognition module is used to obtain the emotion recognition result of the sentence based on the context feature of the sentence, the speaker state feature of the speaker of the sentence, and the speaker switching state of the sentence for each sentence .

Another technical solution of the present application is as follows: Provide an emotion recognition device based on cyclic neural network, the device includes a processor, and a memory coupled to the processor, and the memory stores the memory used to implement the above-mentioned cycle-based Program instructions of a neural network emotion recognition method; when the processor is used to execute the program instructions stored in the memory, the following steps are implemented:

Obtain the text features of each sentence in the dialogue content;

Another technical solution of the present application is as follows: a storage medium is provided, and the storage medium stores program instructions capable of realizing the above-mentioned cyclic neural network-based emotion recognition method, and when the program instructions are executed by a processor, the following steps are implemented :

Obtain the text features of each sentence in the dialogue content;

The method, device and storage medium for emotion recognition based on cyclic neural network of the present application obtain the text feature of each sentence in the dialogue content, and encode the text feature of the sentence to obtain the context feature of each sentence; For each sentence, update the speaker status feature of the speaker of the sentence based on the text feature of the sentence; then for each sentence, determine the speaking of the sentence based on the speaker of the sentence and the speaker of the previous sentence Person switching state; Finally, for each sentence, the emotion recognition result of the sentence is obtained based on the context feature of the sentence, the speaker state feature of the sentence's speaker, and the speaker switching state of the sentence; through the above method , When calculating the probability of emotion label, the switch embedding formed by the speaker switching state strengthens the context characteristics of the sentence and the speaker state characteristics, and improves the accuracy of emotion recognition. At the same time, it can be more improved by perceiving the speaker switching state in the dialogue. Accurately model the dependence relationship between speakers and the speaker itself; modeling speaker state characteristics based on the speaker's own sentence, simplifying the calculation process, and improving the calculation efficiency without affecting the accuracy of emotion recognition.

【Explanation of the drawings】

Fig. 1 is a flowchart of a method for emotion recognition based on a recurrent neural network according to a first embodiment of the application;

2 is a schematic diagram of the model in the method for emotion recognition based on recurrent neural network according to the first embodiment of this application;

3 is a flowchart of a method for emotion recognition based on a recurrent neural network according to a second embodiment of the application;

4 is a schematic structural diagram of an emotion recognition device based on a recurrent neural network according to a third embodiment of the application;

FIG. 5 is a schematic structural diagram of an emotion recognition device based on a recurrent neural network according to a fourth embodiment of this application;

FIG. 6 is a schematic structural diagram of a storage medium according to an embodiment of the application.

【Detailed ways】

The following will clearly and completely describe the technical solutions in the embodiments of the present application in conjunction with the drawings in the embodiments of the present application. Obviously, the described embodiments are only a part of the embodiments of the present application, rather than all the embodiments. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of this application.

The terms "first", "second", and "third" in this application are only used for descriptive purposes, and cannot be understood as indicating or implying relative importance or implicitly indicating the number of indicated technical features. Thus, the features defined with “first”, “second”, and “third” may explicitly or implicitly include at least one of the features. In the description of this application, "a plurality of" means at least two, such as two, three, etc., unless otherwise specifically defined. All directional indicators (such as up, down, left, right, front, back...) in the embodiments of this application are only used to explain the relative positional relationship between the components in a specific posture (as shown in the drawings) , Movement status, etc., if the specific posture changes, the directional indication will also change accordingly. In addition, the terms "including" and "having" and any variations of them are intended to cover non-exclusive inclusions. For example, a process, method, system, product, or device that includes a series of steps or units is not limited to the listed steps or units, but optionally includes unlisted steps or units, or optionally also includes Other steps or units inherent to these processes, methods, products or equipment.

The reference to "embodiments" herein means that a specific feature, structure, or characteristic described in conjunction with the embodiments may be included in at least one embodiment of the present application. The appearance of the phrase in various places in the specification does not necessarily refer to the same embodiment, nor is it an independent or alternative embodiment mutually exclusive with other embodiments. Those skilled in the art clearly and implicitly understand that the embodiments described herein can be combined with other embodiments.

FIG. 1 is a schematic flowchart of a method for emotion recognition based on a recurrent neural network according to a first embodiment of the present application. It should be noted that if there is substantially the same result, the method of the present application is not limited to the sequence of the process shown in FIG. 1. As shown in Figure 1 and Figure 2, the method for emotion recognition based on recurrent neural network includes the following steps:

S101: Acquire the text feature of each sentence in the dialogue content.

In step S101, first, a natural language tool, such as a word segmentation tool provided by a deep learning framework, is used to segment each sentence in the dialogue content to obtain the word sequence of each sentence. Then select an appropriate vector transformation model, for example, select the GloVe model that represents words as real-valued vectors, and convert each word in the word sequence of the sentence into real-valued vectors. V is the dimension of the word vector. For example, V can be 300. Specifically, a word sequence resulting sentence u _i is {x1, x2, ..., xt }, where, t is the sentence length sentence u _i is, X _j is V dimension word in the sentence u _i in the j-th word corresponding vector.

Subsequently, _{the word sequence of the sentence u i} is input into the convolutional neural network. The convolutional neural network is used as the sentence encoder of this embodiment to extract the text features of the sentence by using each word vector of the sentence. The convolutional neural network includes a Convolutional layer, a pooling layer and a fully connected layer. The convolutional layer uses a variety of convolutional filters of different sizes to extract the n-gram features of the sentence's word sequence, assuming Ui ∈ R(t×V) represents the input of a sentence, t is the length of the sentence, V is the dimension of the word vector, x _j is the V-dimensional word vector corresponding to the jth word in the sentence, Wa∈R ^K1×V is Convolution filter, K1 is the length of n-gram, that is, the length of the sliding window on the sentence, which is used to extract features in different positions of the sentence. In an alternative embodiment, three convolution filters with heights of 3, 4, and 5 are respectively used, the number of each convolution filter is 100, and each convolution filter corresponds to 100 feature maps ( feature map). The feature map output by the convolutional layer is input to the collection layer. First, the maximum pooling operation (maximum operation) is performed to extract the strongest feature of each feature map, minimize the amount of parameters, and then use a modified linear unit Relu: Rectified Linear Unit) This activation function performs processing, and outputs the processing result to the fully connected layer, and the fully connected layer outputs the text feature of the sentence, that is, the sentence vector u _t .

S102: Encoding the text feature of each sentence to obtain the context feature of each sentence.

In step S102, in order to model the influence of the dialogue context on the current sentence, that is, the dependence relationship between speakers, a long and short-term memory network is used as a context encoder. The input of the encoder is the sentence vectors of all sentences in the dialogue, these vectors are generated by the sentence encoder, and the output is the sentence encoding fused with context information, that is, the context feature c _{t of the} current sentence. Experiments show that the context encoder composed of long and short-term memory networks can effectively capture contextual information and model the dependency between the current sentence and other sentences in the dialogue. That is, in step S102, the sentence vector in the dialogue content is context-encoded, and the obtained context feature vector is a text representation vector containing the semantic relationship between sentences.

In an optional implementation manner, based on the text feature vector of the sentence, processing is performed through a long and short-term memory model to obtain the context feature vector of the sentence. First, according to the text feature vector of the sentence, through the long short-term memory model, the forward long-short-term memory feature vector and the backward long-short-term memory feature vector of the sentence are obtained; then, the forward long-short-term memory feature vector of the sentence is obtained. The concatenation of the short-term memory feature vector and the backward long- and short-term memory feature vector obtains the context feature of the sentence.

Specifically, for the first sentence in the dialogue content, the text feature vector of the first sentence is obtained; the text feature vector of the first sentence is input into the first long- and short-term memory network model to obtain the first Output result; obtain the text feature vector of the second sentence adjacent to the first sentence; input the first output result and the text feature vector of the second sentence into the first long short-term memory In the network model, the second output result is obtained; for each sentence, the output result of the previous round and the text feature vector of the sentence are input into the first long and short-term memory network model to obtain the context of the sentence Feature vector; repeat the above steps until the context feature of each sentence is obtained.

This embodiment adopts the context encoder composed of the first long short-term memory network (LSTM) model, where the LSTM consists of three gates, "forgetting gate, input gate, output gate". The forgetting gate decides to let those information pass through a cell , Can also be called a cell), the input gate determines how much new information is added to the cell, and the output gate determines what value to output. Specifically, when the LSTM receives the information from the previous moment at time t, the cell (the neuron of the LSTM) must first decide to forget part of the information, and the forgetting gate controls the parameters of the forgetting. The input of this gate is the input x _t _{at the current moment and the output h t-1} at the previous moment. The formula of the forget gate is as follows:

f _t =σ(W _f ·[h _t-1 ,x _t ]+b _f )

Where ft is the cyclic weight of the forget gate, which is used to indicate how much information the current network has to forget at time t; σ is the activation function (sigmoid function), which is used to control the value range between 0 and 1; W _f Is the input weight _{of the forgetting gate, and b f} is the bias of the forgetting gate.

After discarding the useless information, the cell needs to decide which new input information to absorb. The formula for the input gate is as follows:

i _t =σ(W _i ·[h _t-1 ,x _t ]+b _i )

Where I _t is a cyclic input gates heavy weights, used to represent the time t step, how much information to be input into the network; [sigma] is the activation function (Sigmoid Function) for controlling the range of values between 0 and 1; W _i is the input weight of the input gate, and bf is the bias of the left input gate.

Cell candidates at the current moment:

C _t '=tanh(W _c ·[h _t-1 ,x _t ]+b _c ), where Ct' is the cell candidate, W _c is the input weight of the cell candidate, and x _t is the input at the current moment x _t , h _t-1 are the output at the previous moment, b _c is the bias of the cell candidates; tanh is a hyperbolic function, used to control the value range between -1 and 1.

The cell state is updated to obtain a new cell state, which is calculated from the selective forgetting of the old cell state and the candidate cell state:

_{_{_{C t = f t * C t}}} -1 + i t * C t ', wherein _t where C is the new state value of a cell, i.e., the network output step time t, the current long-term memory for storing a network; F _t is forgotten door cycle weight, C _t-1 is a cell state value at the previous time, i.e., time t-1 output step network, for storing long-term memory before the time step _t; i t is a cyclic right input gates weight, It is used to indicate how much information needs to be entered into the network at time t; C _t 'is the cell candidates at the current moment, that is, updates, used to indicate how much information the current network needs to update at time t.

Finally, the output gate plays a role to determine the output vector h _{t of the} hidden layer at the current moment. The definition of the output gate:

o _t =σ(W _o ·[h _t-1 ,x _t ]+b _o )

Among them, o _t is the weight of the input gate, σ is the activation function (sigmoid function), W _o is the connection weight _{of the output gate, b o} is the bias of the output gate, and x _t is the input at the current moment, that is, the time t-step network H _t-1 is the output of the previous moment, that is, the output of the network at time t-1, which is used to store the short memory before time t.

The output of the hidden layer at the current moment is the activated cell state, which is output through the output gate:

h _t ＝o _t *tanh(C _t )

Among them, o _t is the weight of the input gate, which is used to indicate how much information the current network should output at time t; C _t is the updated cell state value at the current time, and h _t is the output at the current time, that is, time t step The output of the network is used to store the short memory of the current network; tanh is a hyperbolic function used to control the value range between -1 and 1.

Among them, W _f , W _i , W _c , W _o , b _f , b _i , b _c , b _o are the parameters of the network, and the network trains these parameters to make the performance better.

S103. For each sentence, update the speaker status feature of the speaker of the sentence based on the text feature of the sentence, wherein the speaker status feature before the update is based on all previous sentences of the speaker Of the text features.

In step S103, in order to model the self-dependence of the speaker in the dialogue, this embodiment uses another long and short-term memory network as the speaker encoder, and sets corresponding parameters for each participant (speaker) of the dialogue. Speaker status, the speaker status of each conversation participant (speaker) is updated only by the sentence said by the participant (speaker). In this embodiment, the historical sentence of each speaker is modeled separately as a memory unit, and then the memory of each speaker is merged with the representation of the current sentence through the attention mechanism, thereby simulating the state of the speaker. For each sentence in the conversation content, use the sentence to update the speaker status of the speaker corresponding to the sentence. Specifically, for the current sentence u _t , in order to simplify the description, the sentence and the sentence vector of the sentence use the same symbol It means that the status feature of the speaker of the current sentence is updated by the status feature of the speaker of the current sentence at the previous moment and the text feature u _{t of the} current sentence. The speaker of the sentence u _t is q=q(u _t ), then the speaker q is in The state characteristic s _q, t at time t is updated by the following formula:

s _q,t = LSTM(u _t )

Among them, s _{q,0 is} initialized as a zero vector. Different from the relatively complex speaker encoders in DialogueRNN and DialogueGCN, which need to consider sentences spoken by other people, the speaker encoder of this embodiment is simpler to implement and the effect is also excellent. For other speakers, their speaker status characteristics are not updated.

For a speaker, the speaker state can be generated in the following manner: obtain multiple sentences of the speaker in the dialogue content, and input the text features of the multiple sentences into the second long-term short-term memory network model , To extract the status characteristics of the speaker. Specifically, for the first sentence of the speaker, the text feature vector of the first sentence is obtained; the initialization feature of the speaker and the text feature vector of the first sentence are input into the second long short-term memory network model , Obtain the first output feature; obtain the text feature vector of the second sentence adjacent to the first sentence; input the first output feature and the text feature vector of the second sentence into the second long and short term In the memory network model, the second output feature is obtained; the above steps are repeated until the current sentence of the speaker, and the output feature of the previous round and the text feature vector of the current sentence are input to the second long and short-term memory In the network model, the state characteristic _{st of} the speaker is obtained.

In an optional embodiment, the speaker’s state features include the speaker’s speech emotion information, and the speaker’s emotional changes are sensed through the speaker’s state features, which is beneficial to the emotional recognition of the speaker’s sentence in the dialogue content. .

In another optional embodiment, the speaker’s state features include not only the emotional information of the utterance, but also the attribute information of the speaker. For example, the attribute information includes age, gender, hobbies, speaking style, and place of attribution. And one or more of the educational level.

S104: For each sentence, determine the speaker switching state of the sentence based on the speaker of the sentence and the speaker of the previous sentence.

In step S104, in order to more accurately model the dependence between speakers (step S102) and the self-dependence of the speakers (step S103), it is necessary to make the model perceive the speaker switching. For this purpose, the concept of speaker switching state is proposed in this embodiment. For the current sentence u _t in the dialogue content, the current sentence is the t-th sentence, and the speaker switching state depends on the speaker q(u _t ) at time t (the t-th sentence) and the speaker at time t-1 (the t-th sentence). 1 sentence) speaker q(u _t-1 ): If the two speakers are the same, the speaker switching state at time t (t sentence) is the first value, otherwise it is the second value, specifically , The first value can be 1, and the second value can be 0. The following formula describes the calculation of the _{speaker switching state b t at time t:}

In this embodiment, an embedding layer G is used to embed the speaker switching state into a 100-dimensional space, and the parameters of the embedding layer G are updated during model training.

S105: For each sentence, obtain an emotion recognition result of the sentence based on the context feature of the sentence, the speaker state feature of the speaker of the sentence, and the speaker switching state of the sentence.

In step S105, the emotion label categories include happiness, sadness, neutrality, excitement, anger, and frustration. For the current sentence, switch the context feature c _t of the current sentence, the state feature _st of the speaker of the current sentence, and the speaker switch The states are connected together to form a new vector, the new vector is input to a fully connected layer, and the normalized exponential function is used to output the probability of each emotion label category of the current sentence, and finally the output of each sentence in the dialogue content Probability distribution of sentiment label category.

FIG. 3 is a schematic flowchart of a method for emotion recognition based on a recurrent neural network according to the first embodiment of the present application. It should be noted that if there are substantially the same results, the method of the present application is not limited to the sequence of the process shown in FIG. 3. As shown in Figure 3, the method for emotion recognition based on recurrent neural network includes the following steps:

S201: Acquire the text feature of each sentence in the dialogue content.

S202: Encoding the text feature of each sentence to obtain the context feature of each sentence.

S203. For each sentence, update the speaker status feature of the speaker of the sentence based on the text feature of the sentence, wherein the speaker status feature before the update is based on all previous sentences of the speaker Of the text features.

S204: For each sentence, determine the speaker switching state of the sentence based on the speaker of the sentence and the speaker of the previous sentence.

S205. Upload the text feature of the sentence, the context feature of the sentence, the speaker status feature of the sentence, and the speaker switching feature of the sentence to the blockchain, so that the blockchain can compare all The text feature of the sentence, the context feature of the sentence, the speaker status feature of the sentence, and the speaker switching feature of the sentence are encrypted and stored.

S206: For each sentence, obtain the emotion recognition result of the sentence based on the context feature of the sentence, the speaker state feature of the speaker of the sentence, and the speaker switching state of the sentence.

In step S205, corresponding summary information is obtained based on the text feature of the sentence, the context feature of the sentence, the speaker status feature of the sentence, and the speaker switching feature of the sentence, specifically, the summary information The text feature of the sentence, the context feature of the sentence, the speaker state feature of the sentence, and the speaker switching feature of the sentence are hashed, for example, obtained by using the sha256s algorithm. Uploading summary information to the blockchain can ensure its security and fairness and transparency to users. The user equipment can download the summary information from the blockchain to verify whether the text feature of the sentence, the context feature of the sentence, the speaker status feature of the sentence, and the speaker switching feature of the sentence have been tampered with . The blockchain referred to in this example is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm. Blockchain, essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information for verification. The validity of the information (anti-counterfeiting) and the generation of the next block. The blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.

For details of other steps, refer to the description of the first embodiment, which will not be repeated here.

FIG. 4 is a schematic structural diagram of an emotion recognition device based on a recurrent neural network according to a third embodiment of the present application. As shown in FIG. 4, the device 30 includes a sentence encoder 31, a context encoder 32, a speaker encoder 33, a speaker conversion module 34, and an emotion recognition module 35. The sentence encoder 31 is used to obtain each conversation content. The text feature of each sentence; the context encoder 32 is used to encode the text feature of each sentence to obtain the context feature of each sentence; the speaker encoder 33 is used for each sentence based on the The text feature of the sentence updates the speaker status feature of the speaker of the sentence, wherein the speaker status feature before the update is obtained based on the text features of all previous sentences of the speaker; speaker conversion module 34 For each sentence, determine the speaker switching state of the sentence based on the speaker of the sentence and the speaker of the previous sentence; the emotion recognition module 35 is used for each sentence, based on the sentence The context feature of the sentence, the speaker status feature of the speaker of the sentence, and the speaker switching status of the sentence are used to obtain the emotion recognition result of the sentence.

Further, the sentence encoder 31 is used to use natural language tools to segment each sentence in the dialogue content to obtain the word sequence of each sentence; using the GloVe model, separate each word in the word sequence of the sentence Convert it into a corresponding word vector; input the word sequence of the sentence into a convolutional neural network to obtain the sentence vector of each sentence, and use the sentence vector as the text feature. Further, the context encoder 32 is configured to process the first long short-term memory model based on the text feature of each sentence to obtain the context feature of each sentence. Furthermore, the context encoder 32 is used to obtain the text feature of the first sentence in the dialogue content, and input the text feature of the first sentence into the first long-short-term memory network model to obtain the first output Result; obtain the text feature of the second sentence adjacent to the first sentence, and input the first output result and the text feature of the second sentence into the first long-short-term memory network model , Obtain the second output result; for each of the sentences, repeat the above steps until the current sentence, input the output result of the previous round of sentences and the text features of the current sentence into the first long and short-term memory In the network model, the context feature of the current sentence is obtained; the above steps are repeated until the context feature of each sentence is obtained.

Further, the speaker encoder 33 is also used to obtain multiple sentences of the speaker in the dialogue content, input the text features of the multiple sentences into the second long and short-term memory network model, and extract the speaker State characteristics. Furthermore, the speaker encoder 33 is configured to obtain the text feature vector of the first sentence of the speaker, and input the initialization feature of the speaker and the text feature vector of the first sentence into the second long and short-term memory. In the network model, the first output feature is obtained; the text feature vector of the second sentence adjacent to the first sentence is obtained, and the text feature vector of the first output feature and the second sentence are input to the first sentence 2. In the long and short-term memory network model, obtain the second output feature; repeat the above steps until the speaker’s current sentence, and input the output feature of the previous round and the text feature vector of the current sentence to the second In the long and short-term memory network model, the speaker's state characteristics are obtained.

Further, when the speaker of the sentence is the same as the speaker of the previous sentence, the first value is used as the speaker switching state of the current sentence; when the speaker of the sentence is the same as the speaker of the previous sentence When the speakers of are different, the second value is used as the speaker switching state of the sentence. Furthermore, the first value may be 1, and the second value may be zero.

FIG. 5 is a schematic structural diagram of an emotion recognition device based on a recurrent neural network according to a fourth embodiment of the present application. As shown in FIG. 5, the emotion recognition device 40 based on the cyclic neural network includes a processor 41 and a memory 42 coupled to the processor 41.

The memory 42 stores program instructions for realizing the emotion recognition based on the recurrent neural network in any of the above embodiments.

The processor 41 is configured to execute program instructions stored in the memory 42 to perform emotion recognition based on the recurrent neural network.

The processor 41 may also be referred to as a CPU (Central Processing Unit, central processing unit). The processor 41 may be an integrated circuit chip with signal processing capabilities. The processor 41 may also be a general-purpose processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA) or other programmable logic device, a discrete gate or transistor logic device, a discrete hardware component . The general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.

Refer to FIG. 6, which is a schematic structural diagram of a storage medium according to an embodiment of the application. The storage medium of the embodiment of the present application stores program instructions 51 that can implement all the above methods, and the storage medium may be non-volatile or volatile. Wherein, the program instructions 51 may be stored in the above-mentioned storage medium in the form of a software product, and include several instructions for making a computer device (may be a personal computer, a server, or a network device, etc.) or a processor to execute the program. Apply for all or part of the steps of the method described in each embodiment. The aforementioned storage devices include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disks or optical disks and other media that can store program codes. Or terminal devices such as computers, servers, mobile phones, and tablets.

In the several embodiments provided in this application, it should be understood that the disclosed system, device, and method can be implemented in other ways. For example, the device embodiments described above are merely illustrative, for example, the division of units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components can be combined or integrated. To another system, or some features can be ignored, or not implemented. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.

In addition, the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit. The above-mentioned integrated unit can be implemented in the form of hardware or software functional unit. The above are only implementations of this application, and do not limit the scope of this application. Any equivalent structure or equivalent process transformation made using the content of the description and drawings of this application, or directly or indirectly applied to other related technical fields, The same reason is included in the scope of patent protection of this application.

The above are only the implementation manners of this application. It should be pointed out here that for those of ordinary skill in the art, improvements can be made without departing from the creative concept of this application, but these all belong to this application. The scope of protection.

Claims

An emotion recognition method based on recurrent neural network, which includes:

Obtain the text features of each sentence in the dialogue content;

Encoding the text feature of each sentence to obtain the context feature of each sentence;

For each of the sentences, the speaker status feature of the speaker of the sentence is updated based on the text feature of the sentence, wherein the speaker status feature before the update is based on the text of all previous sentences of the speaker Acquired by characteristics;

For each sentence, determine the speaker switching state of the sentence based on the speaker of the sentence and the speaker of the previous sentence;

For each sentence, the emotion recognition result of the sentence is obtained based on the context feature of the sentence, the speaker state feature of the speaker of the sentence, and the speaker switching state of the sentence.
The method for emotion recognition based on recurrent neural network according to claim 1, wherein said obtaining the text feature of each sentence in the dialogue content comprises:

Use natural language tools to segment each sentence in the dialogue content to obtain the word sequence of each sentence;

Using the GloVe model, each word in the word sequence of the sentence is converted into a corresponding word vector;

The word sequence of the sentence is input into the convolutional neural network, the sentence vector of each sentence is obtained, and the sentence vector is used as the text feature.
The method for emotion recognition based on recurrent neural network according to claim 1, wherein said encoding the text feature of each sentence to obtain the context feature of each sentence comprises:

Based on the text feature of each sentence, the first long short-term memory model is used for processing to obtain the context feature of each sentence.
The method for emotion recognition based on a recurrent neural network according to claim 3, wherein the text feature based on each of the sentences is processed by a first long short-term memory model to obtain the context feature of each sentence, include:

Acquiring the text feature of the first sentence in the dialogue content, and inputting the text feature of the first sentence into the first long short-term memory network model to obtain a first output result;

Obtain the text features of the second sentence adjacent to the first sentence, and input the first output result and the text features of the second sentence into the first long short-term memory network model to obtain The second output result;

For each sentence, input the output result of the previous sentence and the text feature of the sentence into the first long-term short-term memory network model to obtain the context feature of the sentence;

Repeat the above steps until the context feature of each sentence is obtained.
The method for emotion recognition based on recurrent neural network according to claim 1, wherein the speaker state feature is obtained through the following steps:

Acquire multiple sentences of the speaker in the dialogue content, input the text features of the multiple sentences into the second long and short-term memory network model, and extract the status features of the speaker.
The method for emotion recognition based on recurrent neural network according to claim 5, wherein said acquiring a plurality of sentences of said speaker in conversation content, and inputting text features of said plurality of sentences into a second long and short-term memory In the network model, extracting the speaker's state characteristics includes:

Acquiring the text feature vector of the first sentence of the speaker, and inputting the initialization feature of the speaker and the text feature vector of the first sentence into the second long short-term memory network model to obtain the first output feature;

Acquire the text feature vector of the second sentence adjacent to the first sentence, and input the first output feature and the text feature vector of the second sentence into the second long short-term memory network model to obtain the first Two output characteristics;

Repeat the above steps until the speaker’s current sentence, input the output features of the previous round and the text feature vector of the current sentence into the second long short-term memory network model to obtain the speaker’s state feature.
The method for emotion recognition based on recurrent neural network according to claim 1, wherein the determining the speaker switching state of the sentence based on the speaker of the sentence and the speaker of the previous sentence comprises:

When the speaker of the sentence is the same as the speaker of the previous sentence, use the first value as the speaker switch state of the sentence;

When the speaker of the sentence is different from the speaker of the previous sentence, use the second value as the speaker switch state of the sentence;

After the speaker switch state of the sentence is determined based on the speaker of the sentence and the speaker of the previous sentence, the method further includes:

Upload the text feature of the sentence, the context feature of the sentence, the speaker status feature of the sentence, and the speaker switching feature of the sentence to the blockchain, so that the blockchain can respond to the sentence The text feature of the sentence, the context feature of the sentence, the speaker state feature of the sentence, and the speaker switching feature of the sentence are encrypted and stored.
An emotion recognition device based on cyclic neural network, wherein the device includes:

Sentence encoder, used to obtain the text features of each sentence in the dialogue content;

A context encoder, which is used to encode the text features of each sentence to obtain the context features of each sentence;

The speaker encoder is used to update the speaker status feature of the speaker of the sentence based on the text feature of the sentence for each sentence, wherein the speaker status feature before the update is based on the speech Obtained from the text features of all previous sentences of the person;

The speaker conversion module is used to determine, for each sentence, the speaker switching state of the sentence based on the speaker of the sentence and the speaker of the previous sentence; and

The emotion recognition module is used to obtain the emotion recognition result of the sentence based on the context feature of the sentence, the speaker state feature of the speaker of the sentence, and the speaker switching state of the sentence for each sentence .
An emotion recognition device based on a recurrent neural network, wherein the device includes a processor, and a memory coupled to the processor, and the memory stores program instructions that can be executed by the processor; When the processor executes the program instructions stored in the memory, the following steps are implemented:

Obtain the text features of each sentence in the dialogue content;

Encoding the text feature of each sentence to obtain the context feature of each sentence;

For each of the sentences, the speaker status feature of the speaker of the sentence is updated based on the text feature of the sentence, wherein the speaker status feature before the update is based on the text of all previous sentences of the speaker Acquired by characteristics;

For each sentence, determine the speaker switching state of the sentence based on the speaker of the sentence and the speaker of the previous sentence;

For each sentence, the emotion recognition result of the sentence is obtained based on the context feature of the sentence, the speaker state feature of the speaker of the sentence, and the speaker switching state of the sentence.
The emotion recognition device based on recurrent neural network according to claim 9, wherein said obtaining the text feature of each sentence in the dialogue content comprises:

Use natural language tools to segment each sentence in the dialogue content to obtain the word sequence of each sentence;

Using the GloVe model, each word in the word sequence of the sentence is converted into a corresponding word vector;

The word sequence of the sentence is input into the convolutional neural network, the sentence vector of each sentence is obtained, and the sentence vector is used as the text feature.
The emotion recognition device based on a recurrent neural network according to claim 9, wherein said encoding the text feature of each sentence to obtain the context feature of each sentence comprises:

Based on the text feature of each sentence, the first long short-term memory model is used for processing to obtain the context feature of each sentence.
The emotion recognition device based on recurrent neural network according to claim 11, wherein the text feature based on each of the sentences is processed by a first long and short-term memory model to obtain the context feature of each of the sentences, include:

Acquiring the text feature of the first sentence in the dialogue content, and inputting the text feature of the first sentence into the first long short-term memory network model to obtain a first output result;

Obtain the text features of the second sentence adjacent to the first sentence, and input the first output result and the text features of the second sentence into the first long short-term memory network model to obtain The second output result;

For each sentence, input the output result of the previous sentence and the text feature of the sentence into the first long-term short-term memory network model to obtain the context feature of the sentence;

Repeat the above steps until the context feature of each sentence is obtained.
9. The emotion recognition device based on recurrent neural network according to claim 9, wherein the speaker state feature is obtained through the following steps:

Acquire multiple sentences of the speaker in the dialogue content, input the text features of the multiple sentences into the second long and short-term memory network model, and extract the status features of the speaker.
The device for emotion recognition based on recurrent neural network according to claim 13, wherein said acquiring a plurality of sentences of said speaker in conversation content, and inputting text characteristics of said plurality of sentences into a second long and short-term memory In the network model, extracting the speaker's state characteristics includes:

Acquiring the text feature vector of the first sentence of the speaker, and inputting the initialization feature of the speaker and the text feature vector of the first sentence into the second long short-term memory network model to obtain the first output feature;

Acquire the text feature vector of the second sentence adjacent to the first sentence, and input the first output feature and the text feature vector of the second sentence into the second long short-term memory network model to obtain the first Two output characteristics;

Repeat the above steps until the speaker’s current sentence, input the output features of the previous round and the text feature vector of the current sentence into the second long short-term memory network model to obtain the speaker’s state feature.
The emotion recognition device based on recurrent neural network according to claim 9, wherein the determining the speaker switching state of the sentence based on the speaker of the sentence and the speaker of the previous sentence comprises:

When the speaker of the sentence is the same as the speaker of the previous sentence, use the first value as the speaker switch state of the sentence;

When the speaker of the sentence is different from the speaker of the previous sentence, use the second value as the speaker switch state of the sentence;

After the speaker switch state of the sentence is determined based on the speaker of the sentence and the speaker of the previous sentence, the method further includes:

Upload the text feature of the sentence, the context feature of the sentence, the speaker status feature of the sentence, and the speaker switching feature of the sentence to the blockchain, so that the blockchain can respond to the sentence The text feature of the sentence, the context feature of the sentence, the speaker state feature of the sentence, and the speaker switching feature of the sentence are encrypted and stored.
A storage medium, wherein program instructions are stored in the storage medium, and the following steps are implemented when the program instructions are executed by a processor:

Obtain the text features of each sentence in the dialogue content;

Encoding the text feature of each sentence to obtain the context feature of each sentence;

For each of the sentences, the speaker status feature of the speaker of the sentence is updated based on the text feature of the sentence, wherein the speaker status feature before the update is based on the text of all previous sentences of the speaker Acquired by characteristics;

For each sentence, determine the speaker switching state of the sentence based on the speaker of the sentence and the speaker of the previous sentence;

For each sentence, the emotion recognition result of the sentence is obtained based on the context feature of the sentence, the speaker state feature of the speaker of the sentence, and the speaker switching state of the sentence.
The storage medium according to claim 16, wherein said acquiring the text characteristics of each sentence in the dialogue content comprises:

Use natural language tools to segment each sentence in the dialogue content to obtain the word sequence of each sentence;

Using the GloVe model, each word in the word sequence of the sentence is converted into a corresponding word vector;

The word sequence of the sentence is input into the convolutional neural network, the sentence vector of each sentence is obtained, and the sentence vector is used as the text feature.
The storage medium according to claim 16, wherein said encoding the text feature of each sentence to obtain the context feature of each sentence comprises:

Based on the text feature of each sentence, the first long short-term memory model is used for processing to obtain the context feature of each sentence.
The storage medium according to claim 16, wherein the speaker status feature is obtained through the following steps:

Acquire multiple sentences of the speaker in the dialogue content, input the text features of the multiple sentences into the second long and short-term memory network model, and extract the status features of the speaker.
The storage medium according to claim 16, wherein the determining the speaker switching state of the sentence based on the speaker of the sentence and the speaker of the previous sentence comprises:

When the speaker of the sentence is the same as the speaker of the previous sentence, use the first value as the speaker switch state of the sentence;

When the speaker of the sentence is different from the speaker of the previous sentence, use the second value as the speaker switch state of the sentence;

After the speaker switch state of the sentence is determined based on the speaker of the sentence and the speaker of the previous sentence, the method further includes:

Upload the text feature of the sentence, the context feature of the sentence, the speaker status feature of the sentence, and the speaker switching feature of the sentence to the blockchain, so that the blockchain can respond to the sentence The text feature of the sentence, the context feature of the sentence, the speaker state feature of the sentence, and the speaker switching feature of the sentence are encrypted and stored.