WO2023078370A1

WO2023078370A1 - Conversation sentiment analysis method and apparatus, and computer-readable storage medium

Info

Publication number: WO2023078370A1
Application number: PCT/CN2022/129655
Authority: WO
Inventors: 夏睿; 肖德斌; 屠要峰; 董修岗; 周祥生
Original assignee: 中兴通讯股份有限公司
Priority date: 2021-11-03
Filing date: 2022-11-03
Publication date: 2023-05-11
Also published as: CN116090474A

Abstract

Provided in the embodiments of the present application are a conversation sentiment analysis method and apparatus, and a computer-readable storage medium. The method comprises: acquiring text data and speech data of a target conversation (S100); and inputting the text data and the speech data into a classification model based on a graph structure and performing sentiment classification on the classification model, wherein the graph structure comprises a plurality of nodes and node features, each node corresponds to an utterance, and each node feature comprises a text feature and a speech feature (S200).

Description

Dialogue sentiment analysis method, device and computer-readable storage medium

Cross References to Related Applications

This application is based on a Chinese patent application with application number 202111295920.7 and a filing date of November 3, 2021, and claims the priority of this Chinese patent application. The entire content of this Chinese patent application is hereby incorporated by reference into this application.

technical field

The embodiments of the present application relate to but are not limited to the technical field of data processing, and in particular relate to a dialogue sentiment analysis method, device and computer-readable storage medium.

Background technique

On the Internet, users are willing to share their experiences and express their opinions through text and voice. Therefore, sentiment analysis is an important direction in the field of natural language processing. However, the current sentiment analysis methods for natural language are usually aimed at personal text expressions, but only through the single modality of the text mode for sentiment analysis, other important features such as intonation will be ignored, resulting in distortion of the analysis results; Sentiment analysis is mainly based on a time series model, and it is not suitable for sentiment analysis of multiple rounds of conversations between multiple people, and will ignore the influence of speakers on sentiment analysis.

Contents of the invention

The following is an overview of the topics described in detail in this article. This summary is not intended to limit the scope of the claims.

In the first aspect, the embodiment of the present application provides a dialogue sentiment analysis method, the method comprising: acquiring the text data and voice data of the target dialogue; inputting the text data and the voice data into a classification model based on a graph structure Carry out emotion classification to obtain emotion classification information, wherein the graph structure includes a plurality of nodes and corresponding node features, the nodes correspond to the words in the target dialogue one-to-one, and the node features include the text data Text features and speech features of the speech data.

In the second aspect, the embodiment of the present application also provides a dialogue emotion analysis device, including: a memory, a processor, and a computer program stored on the memory and operable on the processor, when the processor executes the computer program Realize the dialog sentiment analysis method as described in the first aspect.

In the third aspect, the embodiment of the present application also provides a computer-readable storage medium, which is characterized in that it stores computer-executable instructions, and the computer-executable instructions are used to execute the dialog sentiment analysis method as described in the first aspect .

The embodiment of the present application includes: obtaining the text data and speech data of the target dialogue; inputting the text data and speech data into the classification model based on the graph structure for emotion classification, and obtaining the emotion classification information, wherein the graph structure includes multiple nodes and corresponding The node features correspond one-to-one to the utterances in the target dialogue, and the node features include the text features of the text data and the phonetic features of the voice data.

Additional features and advantages of the application will be set forth in the description which follows, and, in part, will be obvious from the description, or may be learned by practice of the application. The objectives and other advantages of the application will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Description of drawings

The accompanying drawings are used to provide a further understanding of the technical solution of the present application, and constitute a part of the specification, and are used together with the embodiments of the present application to explain the technical solution of the present application, and do not constitute a limitation to the technical solution of the present application.

Fig. 1 is the flow chart of a kind of dialog sentiment analysis method of the embodiment of the present application;

Fig. 2 is the specific flowchart of step S200 in Fig. 1;

Fig. 3 is the concrete flowchart of step S240 in Fig. 2;

Fig. 4 is the specific flowchart of step S241 among Fig. 3;

Fig. 5 is a structural diagram of a dialog sentiment analysis device according to an embodiment of the present application.

Detailed ways

In order to make the purpose, technical solution and advantages of the present application clearer, the present application will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present application, not to limit the present application.

It should be noted that although the functional modules are divided in the schematic diagram of the device, and the logical sequence is shown in the flowchart, in some cases, it can be executed in a different order than the module division in the device or the flowchart in the flowchart. steps shown or described. The terms "first", "second" and the like in the specification, claims or the above drawings are used to distinguish similar objects, and not necessarily used to describe a specific order or sequence.

The embodiment of the present application provides a dialogue emotion analysis method, device, and computer-readable storage medium; text data and voice data are input into a graph structure-based classification model for emotion classification, and emotion classification information is obtained, wherein the graph structure includes multiple Each node corresponds to the corresponding node feature, and the node corresponds to the utterance in the target dialogue. The node feature includes the text feature of the text data and the voice feature of the voice data; it can combine the text data and the voice data dual-modal information of the multi-person dialogue. Emotion classification, which captures the sequence features of the bimodal information of text data and voice data respectively; uses the graph structure to model the target dialogue, and uses the graph attention network to capture the important global features, so that the classification model can focus on the two modalities At the same time, it also pays attention to the semantic features of the two modalities; it strengthens the effectiveness of multi-modal learning and improves the accuracy of emotion classification for multi-person dialogue.

The embodiments of the present application will be further described below in conjunction with the accompanying drawings.

An embodiment of the present application provides a dialogue sentiment analysis method.

Referring to FIG. 1 , FIG. 1 is a flow chart of a method for dialogue sentiment analysis. The dialogue sentiment analysis method includes but is not limited to the following steps:

Step S100, acquiring text data and voice data of the target dialogue.

Regarding the step S100 of obtaining the target dialogue, it should be noted that the target dialogue is a dialogue between multiple persons, and the target dialogue is composed of multiple sentences.

The text data and the voice data are extracted from the target dialogue, and the text data and the voice data are in one-to-one correspondence. The text data corresponds to the entire target dialogue, and the voice data also corresponds to the entire target dialogue. The voice data can be converted from the text data, and the text data can also be converted from the voice data.

It should be noted that the target dialogue may be a voice chat input by multiple people through a microphone, or a voice record, or dialogue text data recognized from chat picture records, or dialogue text data input through a keyboard, etc.

In addition, after acquiring the text data and voice data of the target dialogue, it is necessary to preprocess the text data and voice data, so as to facilitate subsequent emotion classification processing.

The preprocessing method to the voice data is as follows: the conversation is divided into an ordered set of utterances according to the breath or pause of the speaker; the audio data is converted into a preset audio format; in this embodiment, the preset audio format is .wav format ; Of course, other formats may also be used in other embodiments.

The preprocessing method of the text data is as follows: the division of the text data and the audio data is aligned at the utterance level; characters such as stop words and symbols in the text data are removed.

In step S200, text data and voice data are input into a classification model based on a graph structure to classify emotions, and obtain emotion classification information.

Referring to FIG. 2, FIG. 2 is a specific flowchart of step S200. For step S200, it includes but is not limited to the following steps:

Step S210, for each utterance of the target dialogue, perform feature extraction on the text data to obtain multiple text features corresponding to the utterance, and perform feature extraction on the voice data to obtain multiple voice features corresponding to the utterance.

For step S210, the text features of the text data include the first utterance-level features of the text data, and the phonetic features of the voice data include the second utterance-level features of the voice data. For each sentence of the target dialogue, map each word of the text data to a word vector and obtain a word vector matrix, perform utterance-level feature extraction on the word vector matrix, and obtain the first utterance-level feature; and for each sentence of the target dialogue , performing utterance-level feature extraction on the speech data to obtain a second utterance-level feature.

In an embodiment, the method for extracting the first discourse-level feature to the text data is as follows: each word of each sentence of the target dialogue in the text data is mapped to a corresponding word vector by a global vector algorithm (Global Vectors, GloVe), The word vector matrix is composed of word vectors; the word vector matrix is input to the convolutional neural network for utterance-level feature extraction, and the convolutional neural network outputs the first utterance-level features. Among them, the convolutional neural network has three convolution kernels. The dimensions of the three convolution kernels are 3, 4, and 5 respectively. Each convolution kernel corresponds to 50 output channels, and the results of the convolution calculation are maximally pooled. Layer, the result of the maximum pooling layer passes through the fully connected layer to obtain the first utterance-level features of fixed dimensions. In addition, during the training process of the convolutional neural network, the emotional category label of each sentence is used as the training label, and the cross entropy is used as the loss function, and the convolutional neural network is trained through the back propagation algorithm to update and update the network parameters. fine-tuning. Of course, in other embodiments, feature extraction networks with other structures may also be used to extract utterance-level features from text data.

The method of extracting the second utterance-level features from the speech data is as follows: select the configuration file in the openSMILE software, such as IS09_emotion.conf; then input the speech data to the openSMILE software, and the openSMILE software outputs the second utterance-level features of fixed dimensions, the second utterance Level features include Mel cepstral coefficients, frequencies, etc. of speech data. Of course, in other embodiments, other methods may also be used to extract speech-level features from speech data.

In addition, the text feature of the text data also includes the first hidden layer state of the text data, and the speech feature of the speech data also includes the second hidden layer state of the speech data. For each sentence of the target dialogue, the first utterance-level features are input into the encoding network in time sequence to encode the first hidden layer state corresponding to the text data; for each sentence of the target dialogue, the second utterance-level features are time-sequenced Input the encoding network for encoding to obtain the second hidden layer state corresponding to the speech data.

The first utterance-level feature and the second utterance-level feature constitute a bimodal utterance-level feature

in,

denote the first utterance-level features of the i-th utterance in d,

denote the second utterance-level features of the i-th utterance in d.

In an embodiment, the method for encoding the text data to obtain the state of the first hidden layer is as follows: the first utterance-level feature is used as the input of each time step of the long-short-term memory network (Long Short-Term Memory, LSTM), LSTM The network encodes it according to

Get the first hidden layer state corresponding to the text data, where,

For the forward representation of text data in a bidirectional LSTM network,

is the backward representation of text data in a bidirectional LSTM network,

The state of the first hidden layer concatenated for the forward and backward representations of text data in a bidirectional LSTM network.

The method of encoding the speech data to obtain the state of the second hidden layer is as follows: the second utterance-level feature is used as the input of each time step of the Long Short-Term Memory (LSTM) in time series, and the LSTM network encodes it ,according to

Get the first hidden layer state corresponding to the text data, where,

is the forward representation of speech data in a bidirectional LSTM network,

is the backward representation of speech data in a bidirectional LSTM network,

The state of the second hidden layer obtained by concatenating the forward and backward representations of speech data in a bidirectional LSTM network.

In addition, the text feature of the text data also includes a first predicted probability distribution of the text data, and the voice feature of the voice data further includes a second predicted probability distribution of the voice data. For each sentence of the target dialogue, perform probability prediction according to the state of the first hidden layer, and obtain the first predicted probability distribution corresponding to the text data; for each sentence of the target dialogue, perform probability prediction according to the state of the second hidden layer, and obtain the same as A second predicted probability distribution corresponding to the speech data.

In an embodiment, the method for sentiment classification of the first hidden layer state to obtain the first predicted probability distribution of text data is as follows: the first hidden layer state is input to the softmax layer, and the first predicted probability distribution is expressed as

in,

Denotes the first predicted probability distribution of the text data of the i-th utterance in the target dialogue, W ^T is the weight, and b ^T is the configuration parameter. In the process of sentiment classification of text data, the cross-entropy loss needs to be calculated, and the cross-entropy loss is expressed as

Among them, loss ^T is the prediction loss of text data, Corpus represents all the dialogues in the dataset, I represents the number of utterances in the dialogue, and C is the dimension of the output vector, that is, the number of categories of emotion classification.

Carry out emotion classification to the second hidden layer state and obtain the method for the second predicted probability distribution of voice data as follows: the second hidden layer state is input into the softmax layer, and the second predicted probability distribution is expressed as

in,

Indicates the second predicted probability distribution of the speech data of the i-th utterance in the target dialogue, W ^A is the weight, and b ^A is the configuration parameter. The cross-entropy loss needs to be calculated in the emotion classification process of speech data, and the cross-entropy loss is expressed as

Among them, loss ^A is the prediction loss of speech data, Corpus represents all the dialogues in the data set, I represents the number of utterances in the dialogue, and C is the dimension of the output vector, that is, the number of categories of emotion classification.

That is, in this embodiment, the multiple text features of the text data include the first utterance-level feature, the first hidden layer state, and the first predicted probability distribution of the text data; the multiple voice features of the voice data include the second An utterance-level feature, a second hidden layer state, and a second predicted probability distribution. Of course, in other embodiments, other types of features may also be included.

Step S220, for each utterance of the target dialogue, perform feature fusion on multiple text features and multiple voice features to obtain fused features corresponding to the utterance.

For step S220, in an embodiment, for each utterance of the target dialogue, the first utterance-level feature, the second utterance-level feature, the first hidden layer state, the second hidden layer state, the first predicted probability distribution and the second The predicted probability distribution is processed by vector splicing to obtain the fusion features corresponding to the utterance.

The first utterance-level features of the text data and the second utterance-level features of the speech data of the same speaker in the target dialogue are initialized to a fixed-dimensional vector, that is, speaker features. According to the speaking order of the speakers, the speaker features are arranged to obtain the sequence s=[…, s _i ,…], where _si represents the speaker feature corresponding to the i-th utterance in the target dialogue.

The speaker features, the first hidden layer state, the second hidden layer state, the first predicted probability distribution and the second predicted probability distribution are vector spliced in the feature dimension, expressed as

where h _i represents the fusion feature corresponding to the i-th utterance in the target dialogue. Then according to the fusion feature, the dialogue can be expressed as d=[h ₁ ,..., _hi ,...].

In step S230, a graph structure is constructed with utterances as nodes and fusion features as node features.

For step S230, it includes but is not limited to the following steps: use each sentence of the target dialogue as each node of the graph structure; use the fusion feature corresponding to the statement as the node feature of the node corresponding to the statement; each sentence in the target dialogue The information of is related to the global information, and the nodes are connected in pairs to obtain a completely undirected graph as a graph structure, and a completely undirected graph corresponds to a dialogue.

Step S240, perform emotion classification on the target dialogue according to the graph structure, and obtain emotion classification information.

Referring to FIG. 3, FIG. 3 is a specific flowchart of step S240. For step S240, it includes but not limited to the following steps:

Step S241, update the node features of the graph structure based on the attention mechanism, and obtain new node features fused with global feature information.

Referring to FIG. 4, FIG. 4 is a specific flowchart of step S241. For step S241, it includes but not limited to the following steps:

Step S2411, perform linear mapping on all nodes of the graph structure to obtain new nodes. Each time the graph structure is mapped, a layer of network is obtained; the entire graph attention network can be a multi-layer network structure. Among them, the linear map can be expressed as

in

is the input of the linear map of the l-th layer network of the graph attention network, W ^(l) is the weight of the linear map of the l-th layer network of the graph attention network,

is the output of the linear map of the l-layer network of the graph attention network.

Step S2412, determine the central node from the new nodes, and determine the adjacent nodes adjacent to the central node; it should be noted that, in this embodiment, each new node is calculated as the central node in turn; wherein in this embodiment In , the adjacent nodes adjacent to the central node are the nodes directly connected to the central node.

Step S2413, respectively calculate the first attention weight from the central node to each adjacent node; perform vector splicing on the node features of the two nodes, do a dot product operation between the spliced vector and a learnable weight vector, and then point The result of the product operation is nonlinearly activated by the LeakyReLU activation function to obtain the attention score between the two nodes. The attention score can be expressed as

in,

is the attention score of the i-th node to the j-th node of the l-th layer network of the graph attention network,

is the learnable weight vector of the l-th layer network,

is the i-th node of the l-layer network,

is the jth node of the l-th layer network; calculate the attention scores from the central node to each adjacent node according to the above method; according to

The first attention weight from the central node to each adjacent node is calculated from the attention scores from the central node to each adjacent node, that is, the attention scores are normalized. for

The formula,

is the first attention weight of the i-th node to the j-th node in the l-th layer network, and N _i is the set of all adjacent nodes.

Step S2414, weighting and summing all first attention weights to obtain a second attention weight, and performing non-linear activation on the second attention weight through a ReLU activation function to obtain new node features fused with global feature information; the updated node features process can be expressed as

in,

is the new node feature,

is the second attention weight. At the same time, the new node features are used as the output of the current layer network.

In addition, the graph attention network obtained by updating the node features of the graph structure based on the attention mechanism has a multi-layer structure, and the input of each layer

is the output of the previous layer

Right now

Step S242, perform emotion classification on the target dialogue according to the new node features, and obtain emotion classification information.

For step S242, the output result of the graph attention network with two-layer structure

As the input of the classifier, in this embodiment, the classifier adopts a softmax classifier, of course, other types of classifiers may also be used in other embodiments. Through the softmax classifier pair

Probability distributions for predicting sentiment categories, i.e.

in,

is the predicted probability distribution of the i-th utterance in the target dialogue,

The category corresponding to the maximum probability of is the emotion classification information.

For step S242, the loss of the emotion classification main task is calculated by cross-entropy loss, and the prediction loss of the main task is expressed as:

Among them, loss ^B is the prediction loss of the main task, Corpus represents all the dialogues in the data set, I represents the number of utterances in the dialogue, and C is the dimension of the output vector, that is, the number of categories of emotion classification.

Then for the entire classification model, the final loss includes the prediction loss of text data, the prediction loss of speech data, the prediction loss of the main task and the weighted sum after the regularization item, and the final loss is expressed as: loss ^all = λ ^T loss ^T + λ ^A loss ^A +λ ^B loss ^B +λ ^r ‖θ‖ ² , where loss ^all is the final loss, loss ^T is the prediction loss of text data, loss ^A is the prediction loss of voice data, and loss ^B is the prediction of the main task Loss, λ ^T , λ ^A , λ ^B and λ ^r are the weights of loss ^T , loss ^A , loss ^B and L2 regularization term ^‖θ‖2 , respectively, θ represents the set of all adjustable parameters in the classification model. The parameters of the whole classification model are fine-tuned through the final loss, so that the sentiment classification of the classification model is more accurate.

In this embodiment, the text data and voice data dual-modal information of multi-person dialogue are used to classify emotions, and the sequence features of the text data and voice data dual-modal information are captured respectively, and the two are independently predicted and compared. We use this as an auxiliary task to enhance the feature representation of the two modalities; use the graph structure to model the target dialogue, and use the graph attention network to capture the global important features, so that the classification model can also focus on the sequential features of the two modalities. Focus on the semantic features of the two modalities; strengthen the effectiveness of multimodal learning, and improve the accuracy of emotion classification for multi-person dialogue.

In addition, the embodiment of the present application also provides a dialog sentiment analysis device.

Referring to FIG. 5 , FIG. 5 is a structural diagram of a dialogue emotion analysis device. The dialog sentiment analysis device includes: a memory 20 , a processor 10 and a computer program stored in the memory 20 and operable on the processor 10 . When the processor 10 executes the computer program, the above dialog sentiment analysis method is realized.

The processor 10 and the memory 20 may be connected through a bus 30 or other means.

The memory 20, as a non-transitory computer-readable storage medium, can be used to store non-transitory software programs and non-transitory computer-executable programs. In addition, the memory 20 may include a high-speed random access memory, and may also include a non-transitory memory, such as at least one magnetic disk storage device, a flash memory device, or other non-transitory solid-state storage devices. In some embodiments, the memory 20 includes memory located remotely from the processor, and these remote memories may be connected to the processor through a network. Examples of the aforementioned networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The non-transitory software programs and instructions required to realize the information processing method of the above-mentioned embodiment are stored in the memory 20, and when executed by the processor, the dialogue emotion analysis method in the above-mentioned embodiment is executed, for example, step S100 described above is executed Go to step S200, step S210 to step S240, step S241 to step S242, and step S2411 to step S2414.

The node embodiments described above are only illustrative, and the units described as separate components may or may not be physically separated, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, the embodiment of the present application also provides a computer-readable storage medium, the computer-readable storage medium stores computer-executable instructions, and the computer-executable instructions are executed by a processor or a controller, for example, by a processor Execution may cause the above processor to execute the dialogue emotion analysis method in the above embodiment, for example, execute the steps S100 to S200, S210 to S240, S241 to S242, and S2411 to S2414 described above.

The embodiment of the present application can combine the bimodal information of text data and voice data of multi-person dialogue to classify emotions, capture the respective sequence characteristics of the bimodal information of text data and voice data respectively; use the graph structure to model the target dialogue, The graph attention network is used to capture important global features, so that the classification model not only focuses on the sequence features of the two modalities, but also pays attention to the semantic features of the two modalities; it can target multi-person dialogue and strengthen the effectiveness of multi-modal learning. , improving the accuracy of sentiment classification.

Those skilled in the art can understand that all or some of the steps and systems in the methods disclosed above can be implemented as software, firmware, hardware and an appropriate combination thereof. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application-specific integrated circuit . Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). As known to those of ordinary skill in the art, the term computer storage media includes both volatile and nonvolatile media implemented in any method or technology for storage of information, such as computer readable instructions, data structures, program elements, or other data. permanent, removable and non-removable media. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disk (DVD) or other optical disk storage, magnetic cartridges, tape, magnetic disk storage or other magnetic storage devices, or can Any other medium used to store desired information and which can be accessed by a computer. Furthermore, as is well known to those of ordinary skill in the art, communication media typically embodies computer readable instructions, data structures, program elements, or other data in a modulated data signal such as a carrier wave or other transport mechanism, and may include any information delivery media .

The above is a specific description of the implementation of the present application, but the present application is not limited to the above-mentioned embodiment. Those skilled in the art can also make various equivalent deformations or replacements under the sharing conditions that do not violate the spirit of the present application. These equivalent Any modification or substitution is within the scope defined by the claims of the present application.

Claims

A dialogue sentiment analysis method, comprising:

Obtain the text data and voice data of the target dialogue;

The text data and the voice data are input into a classification model based on a graph structure for emotion classification to obtain emotion classification information, wherein the graph structure includes a plurality of nodes and corresponding node features, and the nodes and the target There is a one-to-one correspondence between utterances in the dialogue, and the node features include text features of the text data and phonetic features of the voice data.
The dialogue emotion analysis method according to claim 1, wherein said inputting said text data and said voice data into a classification model based on a graph structure for emotion classification, and obtaining emotion classification information, comprising:

For each utterance of the target dialogue, perform feature extraction on the text data to obtain various text features corresponding to the utterance, and perform feature extraction on the speech data to obtain various text features corresponding to the utterance. said speech characteristics;

For each utterance of the target dialogue, performing feature fusion on multiple text features and multiple voice features to obtain the fused features corresponding to the utterance;

using the utterance as the node, and using the fusion feature as the node feature to construct a graph structure;

Perform emotion classification on the target dialogue according to the graph structure to obtain the emotion classification information.
The dialog sentiment analysis method according to claim 2, wherein, for each sentence of the target dialog, feature extraction is performed on the text data to obtain multiple text features corresponding to the speech, and The voice data is subjected to feature extraction to obtain a variety of voice features corresponding to the utterance, including:

For each sentence of the target dialogue, each word of the text data is mapped to a word vector to obtain a word vector matrix, and the word vector matrix is subjected to utterance-level feature extraction to obtain a first utterance-level feature;

For each utterance of the target dialogue, perform utterance-level feature extraction on the speech data to obtain a second utterance-level feature.
The dialog sentiment analysis method according to claim 3, wherein, for each sentence of the target dialog, feature extraction is performed on the text data to obtain multiple text features corresponding to the speech, and The voice data is subjected to feature extraction to obtain a variety of voice features corresponding to the utterance, which also includes:

For each utterance of the target dialogue, the first utterance-level feature is input into the encoding network in time sequence to encode the first hidden layer state corresponding to the text data;

For each utterance of the target dialogue, the second utterance-level feature is input into the encoding network in time sequence for encoding to obtain a second hidden layer state corresponding to the speech data.
The dialogue sentiment analysis method according to claim 4, wherein, for each sentence of the target dialogue, feature extraction is performed on the text data to obtain multiple text features corresponding to the speech, and The voice data is subjected to feature extraction to obtain a variety of voice features corresponding to the utterance, which also includes:

For each sentence of the target dialogue, perform probability prediction according to the state of the first hidden layer to obtain a first prediction probability distribution corresponding to the text data;

For each utterance of the target dialogue, perform probability prediction according to the state of the second hidden layer to obtain a second prediction probability distribution corresponding to the speech data.
The dialogue sentiment analysis method according to claim 5, wherein, for each sentence of the target dialogue, a plurality of text features and a plurality of speech features are subjected to feature fusion to obtain a corresponding utterance corresponding to the utterance Fusion features, including:

For each utterance of the target dialogue, for the first utterance-level feature, the second utterance-level feature, the first hidden layer state, the second hidden layer state, the first predicted probability distribution Perform vector splicing processing with the second predicted probability distribution to obtain fusion features corresponding to the utterance.
The dialogue sentiment analysis method according to claim 2, wherein, the described node is used as the node, and the fusion feature is used as the node feature to construct a graph structure, including:

each of the utterances of the target dialogue as each of the nodes of the graph structure;

using the fusion feature corresponding to the utterance as the node feature of the node corresponding to the utterance;

A complete undirected graph is obtained by connecting the nodes in pairs as the graph structure.
The dialogue sentiment analysis method according to claim 2 or 7, wherein said performing sentiment classification on said target dialogue according to said graph structure comprises:

updating node features of the graph structure based on an attention mechanism to obtain new node features that fuse global feature information;

Sentiment classification is performed on the target dialogue according to the features of the new node.
The dialog sentiment analysis method according to claim 8, wherein, the described graph structure is updated based on the attention mechanism to obtain the node features of the fusion global feature information, including:

Perform linear mapping on all the nodes to obtain new nodes;

determining a central node from among the new nodes, and determining adjacent nodes adjacent to the central node;

Calculating the first attention weights from the central node to each of the adjacent nodes respectively;

All the first attention weights are weighted and summed to obtain a second attention weight, and the new node features fused with global feature information are obtained according to the second attention weights.
A dialogue sentiment analysis device, comprising: a memory, a processor, and a computer program stored on the memory and operable on the processor, when the processor executes the computer program, any one of claims 1 to 9 is realized The described dialogue sentiment analysis method.
A computer-readable storage medium, storing computer-executable instructions, when the computer-executable instructions are executed by a processor, the processor executes the dialogue sentiment analysis method according to any one of claims 1 to 9 .